ai-for-genomic-science

Chapter 13: DNA Language Models

Interactive: Chapter 13

What if you found a book written in a language no one had ever seen? No dictionary, no grammar guide, no Rosetta Stone. Just the raw text — millions of pages of it. Could you figure out what it means? Surprisingly, yes — if you have enough text. By noticing which “words” appear near which other “words,” you could discover grammar, syntax, even meaning. The word that always appears between a subject and an object is probably a verb. Words that are interchangeable in context are probably synonyms. This is essentially how modern NLP models cracked human language before anyone handed them a grammar textbook.

Now consider DNA. We have 3 billion letters of text per genome, across thousands of species — but no complete dictionary telling us what every sequence means. We know some words: TATAAA is a core promoter motif, AATAAA is a polyadenylation signal. But 98% of the genome is noncoding, and the regulatory logic written there — which sequences bind which transcription factors, which enhancers activate which genes, which variants disrupt which functions — remains largely unread. Traditional tools approach this with fixed rules and small windows. They can tell you if a 6-mer matches a known motif. They cannot tell you what a sequence means in context.

DNA language models take the approach of the linguist with an unknown text: read billions of sequences, learn the statistical patterns, and use those patterns to predict what each sequence does. No labels required during training. No predetermined grammar. Just the raw text of evolution, accumulated over billions of years of selection, and a model large enough to find the signal within it.

The practical stakes are immediate. A GWAS study of autism spectrum disorder yields 15,000 regulatory variants — each a single nucleotide change in noncoding DNA, each potentially disrupting gene regulation in neurons. Traditional conservation scores narrow the list to 3,000. That’s still $1.5 million in experimental validation at $500 per variant. A DNA language model that genuinely understands genomic context can prioritize that list further — not by pattern-matching to known motifs, but by modeling what those sequences are actually saying.

The Biological Challenge

Understanding regulatory variants requires knowing the context in which they appear. A “CAG” sequence means different things in different genomic neighborhoods:

Traditional tools treat each position independently or use fixed-size windows. They can’t capture long-range dependencies, tissue-specific effects, or the combinatorial logic of multiple regulatory elements working together.

The scale of the problem is staggering:

Experimental validation can’t scale to this level. MPRA (Massively Parallel Reporter Assays) can test thousands of sequences, but that’s a tiny fraction of possible variants. And experiments often miss tissue-specific or developmental-stage-specific effects.

What we need is a model that can:

  1. Learn regulatory “grammar” from the entire genome
  2. Understand context across long distances (thousands of bp)
  3. Transfer knowledge across cell types and conditions
  4. Make predictions for variants that have never been tested

This is precisely what language models were designed to do with text. Can we apply the same principles to DNA?

Learning Objectives

After completing this chapter, you will be able to:

13.1 DNA as a Language: More Than a Metaphor

When we call DNA a “language,” we’re not just making a poetic comparison. There are deep structural similarities between human languages and genomic sequences.

The Language Analogy

Consider these parallels:

In English:

In DNA:

For example, if you see “The cat sat on the ___,” you can predict “mat” based on context and grammar. Similarly, if you see a promoter sequence with a TATA box, you can predict nearby nucleotides that form the transcription start site.

Why Single Nucleotides Aren’t Enough

Early attempts to apply language models to DNA treated each nucleotide as a “letter”:

A C G T A A C G G T A C...

But this misses crucial biological structure. Regulatory function emerges from groups of nucleotides:

It’s like trying to understand English by analyzing individual letters without recognizing words. You’d miss that “c-a-t” together means something different from “c-a-r.”

K-mer Tokenization: Finding the Words in DNA

Biological Analogy (DNABERT k-mer tokenization): Like reading DNA as codons, but instead of 3-letter codons, overlapping 6-letter “words” are used to capture meaningful patterns.

K-mers are sequences of k consecutive nucleotides. Instead of reading DNA letter by letter, we read it in chunks:

For k=3 (3-mers or “codons” in the broad sense):

DNA:     A C G T A A C G G T
3-mers:  ACG CGT GTA TAA AAC ACG CGG GGT

For k=6 (6-mers):

DNA:     A C G T A A C G G T A C
6-mers:  ACGTAA CGTAAC GTAACG TAACGG AACGGT ACGGTA CGGTAC

Notice how k-mers overlap—each position starts a new k-mer. This sliding window captures local context.

Why k-mers work for DNA:

  1. Biological relevance: Many regulatory elements are 6-12bp
  2. Manageable vocabulary: 4^6 = 4,096 possible 6-mers (comparable to common English words)
  3. Context preservation: Overlapping k-mers maintain sequence continuity
  4. Flexibility: Can adjust k based on biological scale of interest

The Vocabulary Size Problem

The choice of k is a trade-off:

K value Vocabulary size Biological relevance Computational cost
3 64 Too short for most motifs Very low
6 4,096 Captures many TF sites Moderate
9 262,144 Rare k-mers, sparse data High
12 16 million Most k-mers never seen Prohibitive

DNABERT uses k=3, 4, 5, and 6 to capture patterns at multiple scales—like understanding text through letters, syllables, and words simultaneously.

13.2 DNABERT: Bidirectional Encoder for DNA

DNABERT (2021) was the first major application of BERT architecture to DNA sequences. Let’s understand how it works.

The BERT Architecture for DNA

Recall from Chapter 10 that BERT uses:

  1. Bidirectional context: Looks at sequence on both sides of each position
  2. Masked language modeling: Predicts hidden tokens from context
  3. Transformer layers: Self-attention to weight important context
  4. Pre-training then fine-tuning: Learn general patterns, then specialize

DNABERT adapts this to genomic sequences:

Input DNA:     A C G [MASK] A A C G G T
K-mer tokens:  ACG CGT GTM MTA TAA AAC ACG CGG GGT
Position:      1   2   3   4   5   6   7   8   9

Transformer processes all positions
↓
Predict masked token: "GTA"

Pre-training DNABERT: Learning Genomic Grammar

DNABERT is pre-trained on the entire human reference genome (hg38)—all 3.2 billion nucleotides. The training process:

Step 1: Convert genome to k-mers

Genome region: ACGTAACGGT...
6-mers:        ACGTAA, CGTAAC, GTAACG, ...

Step 2: Random masking (15% of k-mers)

Original:  ACGTAA CGTAAC GTAACG TAACGG AACGGT
Masked:    ACGTAA [MASK] GTAACG TAACGG [MASK]

Step 3: Model predicts masked k-mers The model must reconstruct the original sequence using bidirectional context.

Step 4: Update weights to minimize prediction error

After seeing billions of examples, DNABERT learns:

What DNABERT Learns: Hidden Knowledge

After pre-training, DNABERT’s internal representations (embeddings) capture biological information without being explicitly told:

Experiment: Ji et al. (2021) analyzed DNABERT embeddings:

  1. Promoter sequences cluster together (similar embeddings)
  2. Splice sites form distinct clusters
  3. Repetitive elements group separately
  4. Conserved motifs have similar representations across contexts

This is remarkable: DNABERT was never told “this is a promoter” or “this is a splice site.” It discovered these patterns just by learning to predict masked nucleotides.

Fine-tuning for Specific Tasks

After pre-training, you can fine-tune DNABERT for specific biological tasks:

Task 1: Promoter identification

Task 2: Transcription factor binding site prediction

Task 3: Variant effect prediction

The key advantage: Pre-training on the entire genome provides a strong foundation. Fine-tuning requires relatively little task-specific data.

13.3 Beyond DNABERT: Next-Generation DNA Language Models

DNABERT showed the potential of language models for genomics, but had limitations. Several next-generation models address these issues.

DNABERT-2: Efficient Training at Scale

DNABERT-2 (2023) made several improvements:

1. Byte Pair Encoding (BPE) for tokenization

Instead of fixed k-mers, BPE learns optimal “words” from the data:

Common sequence:        ACGTAA ACGTAA ACGTAA (appears often)
BPE learns:            ACGTAA is a single token (not 6 separate)

Rare sequence:         GGGGGG (appears rarely)  
BPE keeps separate:    GG GG GG (or G G G G G G)

Benefits:

2. More efficient context use

DNABERT used fixed overlapping k-mers, so a 512-token input covered a fixed nucleotide span. DNABERT-2 uses BPE tokens of variable length, so the nucleotide span depends on the sequence and tokenization. In practice, this can cover several kilobases rather than a fixed 512 bp window, but it should not be described as a universal 10 kb receptive field.

This captures:

3. Multi-species pre-training

DNABERT-2 trains on genomes from multiple species:

This helps the model learn:

Performance improvements:

Nucleotide Transformer: Scaling Up

Nucleotide Transformer (2023) takes a different approach: scale up model size and training data.

Architecture:

Key insight: Larger models trained on more diverse data capture more nuanced patterns.

Novel feature: Cross-species embeddings

Because it trains on many species, Nucleotide Transformer can:

Example:

Human enhancer:     ACGTAAGGCTAG...
Mouse ortholog:     ACTTAAGGCCAG... (60% identity)
Zebrafish element:  GCGTAAGGCTGC... (45% identity)

Nucleotide Transformer embeddings show these sequences are functionally similar
despite sequence divergence

LOGO: Language of Genomes in One

LOGO (2024) addresses a fundamental limitation: previous models treat all genomic regions equally.

The problem:

LOGO’s solution: Multi-task pre-training

During pre-training, LOGO simultaneously learns:

  1. Masked language modeling (predict hidden nucleotides)
  2. Region type classification (promoter, enhancer, exon, etc.)
  3. Chromatin state prediction (active, repressed, etc.)
  4. Conservation scoring

Architecture:

Input sequence → LOGO encoder
                      ↓
         ┌────────────┼────────────┐
         ↓            ↓            ↓
    MLM head    Region head   Chromatin head
         ↓            ↓            ↓
   Predict k-mer  Promoter?   H3K27ac?

Advantages:

13.4 GROVER: Learning a Genomic Vocabulary from Sequence

GROVER (Genome Rules Obtained Via Extracted Representations, 2024) takes yet another approach: instead of starting from fixed k-mers, it learns a frequency-balanced vocabulary for the human genome using byte-pair encoding (BPE).

The Context Problem

Many previous models had a fixed token window:

But biological context works at multiple scales:

GROVER’s Sequence-Only Strategy

GROVER trains a BERT-style model on human genome sequence using a BPE vocabulary selected by next-k-mer prediction. The key idea is that a useful DNA vocabulary should not simply list every possible 6-mer; it should group common sequence patterns while still preserving informative rare patterns.

What the paper reports:

What GROVER does not do in its core pretraining setup:

Example application: Variant effect in 3D context

Variant at position X in enhancer
    ↓
GROVER embedding of enhancer
    ↓
Compare to embeddings of potential target promoters
    ↓
Prioritize hypotheses about which regulatory grammar changed

To make this a 3D genome analysis, GROVER embeddings would need to be combined with external chromatin-contact or perturbation data.

[Optional: The Math]

Math Box: Attention Mechanisms in DNA Language Models

All modern DNA language models use attention mechanisms to weigh important context. Let’s break down how this works.

Self-Attention for Sequence Context

Given a sequence of k-mer embeddings, attention computes how much each position should “attend to” every other position.

Input: Sequence embeddings

Position:    1      2      3      4      5
K-mer:     ACGT   CGTA   GTAA   TAAC   AACG
Embedding:  e₁     e₂     e₃     e₄     e₅

For each position i, compute attention to every position j:

Step 1: Create Query, Key, Value matrices

Query₁ = Wq × e₁
Key₂ = Wk × e₂  
Value₂ = Wv × e₂

Step 2: Compute attention scores

score₁,₂ = Query₁ · Key₂ / √d

Where d is the embedding dimension (typically 768). Division by √d prevents scores from getting too large.

Step 3: Apply softmax to get attention weights

attention₁,₂ = exp(score₁,₂) / Σⱼ exp(score₁,ⱼ)

This normalizes attention across all positions (sums to 1).

Step 4: Weighted sum of values

output₁ = Σⱼ attention₁,ⱼ × Valueⱼ

Biological interpretation:

Example: Splice site prediction

For a sequence near a splice donor site:

Position:  ...EXON | GT | INTRON...

Attention pattern shows:
- GT dinucleotide attends to upstream exonic sequence
- GT attends to downstream intronic elements
- Less attention to distant positions

This matches biological reality: splice site recognition depends on nearby exonic/intronic context.

Multi-Head Attention

DNA language models use multiple attention heads (typically 12):

Head 1 might learn: TF binding motifs Head 2 might learn: GC content patterns
Head 3 might learn: Repeat structures Head 12 might learn: Conservation patterns

Each head can specialize in different types of patterns. The model combines all heads to get a rich representation.

Mathematical formulation:

MultiHead(Q,K,V) = Concat(head₁, head₂, ..., headₕ) × Wo

where headᵢ = Attention(QWᵢq, KWᵢk, VWᵢv)

13.5 Comparing DNA Language Models

Let’s compare the major DNA language models:

Model Year Scale Training Data Context Key Feature
DNABERT 2021 BERT-scale Human genome fixed k-mer window Early DNA BERT model
DNABERT-2 2023 117M reported Multi-species genomes variable BPE span Efficient BPE tokenization
Nucleotide Transformer 2023/2024 up to billions of parameters Many genomes across species varies by checkpoint Scaling and cross-species training
LOGO 2024 model-specific Sequence plus functional annotations task-dependent Multi-task sequence learning
GROVER 2024 12-layer BERT-style Human genome sequence BPE token window Frequency-balanced genomic vocabulary

Performance on Standard Tasks

There is no single leaderboard that ranks all DNA language models across all genomic tasks. Performance depends on:

General trends:

Computational Requirements

Training these models from scratch is expensive:

Model type Pretraining cost Fine-tuning cost Inference cost
Small BERT-style DNA model Moderate Low to moderate Low
Efficient BPE model Moderate Low to moderate Low
Billion-parameter model Very high Moderate to high Moderate
Long-context model High, especially for long windows Task-dependent Moderate to high

Good news: You don’t need to train from scratch! Pre-trained models are publicly available. You only pay for fine-tuning on your specific task.

Storage requirements:

13.6 Using DNA Language Models in Practice

Let’s walk through how you’d actually use these models for real biological questions.

Step 1: Choose Your Model

Decision tree:

Need maximum accuracy? → Nucleotide Transformer (but slower)

Need speed and efficiency? → DNABERT-2

Working with non-coding variants? → LOGO or GROVER

Limited computational resources? → DNABERT

Multiple species analysis? → Nucleotide Transformer or DNABERT-2

Step 2: Prepare Your Sequences

DNA language models expect specific input formats:

# Example: Preparing sequences for DNABERT
from transformers import AutoTokenizer

# Load pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M")

# Your sequence
sequence = "ACGTAACGGTACGTA"

# Tokenize
tokens = tokenizer(sequence, return_tensors="pt")

# Now ready for model input

Key considerations:

Step 3: Extract Embeddings or Make Predictions

Option A: Get embeddings for downstream analysis

from transformers import AutoModel

# Load pre-trained model  
model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M")

# Get embeddings
embeddings = model(**tokens).last_hidden_state

# Shape: [batch_size, sequence_length, embedding_dim]
# embedding_dim is typically 768

Option B: Fine-tune for specific prediction

from transformers import AutoModelForSequenceClassification

# Load model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    "zhihan1996/DNABERT-2-117M",
    num_labels=2  # Binary classification
)

# Fine-tune on your labeled data
# (training loop code here)

# Make predictions on new sequences
predictions = model(**tokens)

Step 4: Interpret Results

For embeddings:

For predictions:

Case Study 13.1: Teaching Scenario - Prioritizing Noncoding Variants in Autism

Background: GWAS studies of autism spectrum disorder identified 102 genomic regions associated with the condition. But each region contains dozens to hundreds of variants. Which ones are actually functional?

Important note: The workflow below is a teaching scenario showing how a DNA language model could be used. The specific validation rates, allele frequencies, and variant sequence are illustrative unless replaced with a primary study citation.

Approach:

Researchers used DNABERT-2 to prioritize candidate variants:

Step 1: Fine-tune on brain regulatory data

Step 2: Predict variant effects

Step 3: Validate predictions

Key finding: One variant in an enhancer near CHD8 (a known autism-associated gene):

Reference:  ...ACGTAACG[G]TACGTA...  (predicted: 0.82 active enhancer)
Alternate:  ...ACGTAACG[A]TACGTA...  (predicted: 0.23 active enhancer)

This single nucleotide change disrupts a predicted CTCF binding site, reducing enhancer activity by 60% in MPRA validation.

Clinical significance: In a real study, this final step would require independent genetic association, segregation or de novo evidence where appropriate, phenotype matching, and functional validation. DNABERT-2-style prioritization can nominate candidates, but it cannot establish clinical causality on its own.

Case Study 13.2: Pan-Species Regulatory Element Discovery

Background: Many species lack extensive epigenomic data. Can we use language models trained on well-studied species to predict regulatory elements in unstudied species?

Approach:

Researchers can use Nucleotide Transformer-style embeddings to predict regulatory elements across species with limited annotations:

Step 1: Train on multi-species data

Step 2: Test on species with limited data

Step 3: Make predictions

Illustrative results:

Zebrafish predictions:

Cross-species conservation without alignment:

Human enhancer:        ACGTAAGGCT...
Mouse ortholog:        ACTTAAGGCC... (65% identity)
Zebrafish prediction:  GCGTAAGGCA... (52% identity, no detectable alignment)

All three have similar Nucleotide Transformer embeddings
→ Functionally equivalent despite sequence divergence

Impact: This approach illustrates why multi-species pretraining is useful: embeddings can nominate candidate regulatory elements in species where epigenomic assays are sparse. Actual discovery claims should cite the specific benchmark or experimental validation study.

13.7 Limitations and Challenges

Despite impressive performance, DNA language models have important limitations.

Challenge 1: Context Length Limitations

Most models still can’t capture very long-range interactions:

Current limits:

Biological reality:

Challenge 2: Cell Type Specificity

DNA language models learn from DNA sequence alone (mostly). But:

Example:

Sequence X in embryonic stem cells: Active enhancer
Same sequence in liver cells: Inactive/repressed

DNA language model only sees sequence → Can't distinguish

Partial solutions:

Challenge 3: Structural Variants

Current models work well for SNVs (single nucleotide variants) but struggle with:

Why?

Challenge 4: Interpretation

DNA language models are still “black boxes”:

Attention visualization helps but is limited:

High attention between positions 45 and 67
→ But what does this mean biologically?
→ Which proteins bind there?
→ How does this affect gene expression?

Challenge 5: Training Data Bias

Models learn patterns from training data:

Consequence:

13.8 Future Directions

Where are DNA language models heading?

1. Longer Context Windows

Approaches in development:

Goal: Handle entire chromosomes (100+ million bp) in single model

2. Multi-Modal Integration

Combining DNA sequence with other data:

Example architecture:

DNA sequence → Language model → Embeddings
                                     ↓
H3K27ac signal → CNN → Embeddings ─→ Fusion → Prediction
                                     ↑
ATAC-seq data → CNN → Embeddings ─→

3. Foundation Models for Genomics

Building truly general-purpose models:

This is the approach of recent models like:

4. Evolutionary Understanding

Models that explicitly learn evolutionary constraints:

5. Generative Models

Current models are discriminative (classify/predict). Future models may be generative:

Potential application:

Input: "Design an enhancer active in T cells but not B cells"
Model generates: Novel sequence meeting these criteria
Validate in MPRA

Summary

Key Takeaways:

📖 Key Terms | Term | Definition | |------|-----------| | **Attention mechanism** | Method for the model to weight important sequence context, computing how much each position should "attend to" every other position | | **BERT (Bidirectional Encoder Representations from Transformers)** | Model architecture that processes sequences in both directions simultaneously | | **Byte Pair Encoding (BPE)** | Tokenization method that learns optimal "words" from data based on frequency | | **Context window** | Maximum input span a model can process at once; depending on tokenization, this may be measured in tokens rather than directly in base pairs | | **Embedding** | Numerical vector representation of a sequence that captures its biological properties | | **Fine-tuning** | Adapting a pre-trained model to a specific task with limited task-specific data | | **Foundation model** | Large pre-trained model that can be adapted to many downstream tasks | | **K-mer** | Sequence of k consecutive nucleotides used as a token (e.g., 6-mer = ACGTAA) | | **Masked language modeling (MLM)** | Training objective where the model predicts randomly hidden tokens from surrounding context | | **Multi-head attention** | Using multiple attention mechanisms in parallel, each learning different types of patterns | | **Pre-training** | Training a model on large unlabeled datasets to learn general patterns before fine-tuning | | **Self-attention** | Mechanism allowing each position in a sequence to attend to all other positions | | **Tokenization** | Converting raw DNA sequence into discrete units (tokens) for model input | | **Transfer learning** | Using knowledge learned from one task (pre-training) to improve performance on another task (fine-tuning) | | **Zero-shot learning** | Making predictions on new tasks without any task-specific training examples |

Conceptual Questions

  1. Why is k-mer tokenization more appropriate for DNA than single-nucleotide tokenization? What biological properties make k-mers useful units?

  2. DNABERT learns that promoter sequences cluster together (have similar embeddings) without being explicitly told which sequences are promoters. Explain how masked language modeling enables this unsupervised learning of biological function.

  3. Compare the trade-offs between DNABERT (smaller, human-genome-focused model) and Nucleotide Transformer (larger, multi-species model family). In what scenarios would you choose each?

  4. A researcher wants to predict the effect of a variant in an enhancer 500kb from its target gene. Which DNA language model(s) from this chapter would be most appropriate, and why? What are the limitations?

  5. Explain why DNA language models generally outperform conservation-based methods (like GERP++) for variant effect prediction, even though conservation scores use evolutionary information across species.

  6. If you fine-tune DNABERT on enhancer data from liver cells, can you use it to predict enhancers in brain cells? Why or why not? What additional information would help?

  7. Attention weights in DNA language models often show high attention between positions that are far apart in sequence. What biological interactions might this represent? Give specific examples.

  8. LOGO uses multi-task pre-training (masked language modeling + region classification + chromatin state prediction). Why does learning multiple tasks simultaneously improve performance compared to learning each task separately?


Discussion Questions

  1. Ethical considerations: DNA language models can predict regulatory effects of variants, but these predictions aren’t perfect. How should we handle cases where a model predicts high functional impact for a variant, but clinical geneticists are uncertain? Who should make the final decision about variant interpretation?

  2. Data representation: These models are trained primarily on human reference genomes and well-studied populations. How might this bias affect predictions for variants common in under-represented populations? What steps could be taken to address this?

  3. Mechanistic understanding vs. prediction accuracy: DNA language models can achieve high accuracy without explaining how a variant causes its effect. Is this acceptable for clinical use? When is mechanistic understanding essential vs. when is accurate prediction sufficient?

  4. Resource allocation: Training large DNA language models can require substantial GPU infrastructure and engineering time. Is this a good use of research funding compared to funding experimental validation studies? How should the field balance computational vs. experimental approaches?

  5. Generalization limits: Current models work well for SNVs in regulatory regions but struggle with structural variants and coding sequences. Should we develop specialized models for each variant type and genomic context, or pursue a single “universal” model? What are the trade-offs?

Further Reading

Foundational Papers

DNABERT (2021) Ji, Y., Zhou, Z., Liu, H., & Davuluri, R. V. (2021). DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37(15), 2112-2120.

Nucleotide Transformer (2023) Dalla-Torre, H., Gonzalez, L., Mendoza-Revilla, J., et al. (2024). The Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nature Methods.

DNABERT-2 (2023) Zhou, Z., Ji, Y., Li, W., et al. (2023). DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome. arXiv preprint.

GROVER (2024) Sanabria, M., Hirsch, J., Joubert, P. M., et al. (2024). DNA language model GROVER learns sequence context in the human genome. Nature Machine Intelligence.

Recent Reviews

DNA Language Models for Variant Effects (2023) Benegas, G., Batra, S. S., & Song, Y. S. (2023). DNA language models are powerful predictors of genome-wide variant effects. PNAS, 120(44), e2311219120.

Language Models for Biological Research (2024) Simon, E., Swanson, K., & Zou, J. (2024). Language models for biological research: a primer. Nature Methods, 21, 1422-1429.

Online Resources

Hugging Face Model Hub - Genomics Models https://huggingface.co/models?pipeline_tag=feature-extraction&search=dna

Nucleotide Transformer GitHub https://github.com/instadeepai/nucleotide-transformer

DNABERT Documentation https://github.com/jerryji1993/DNABERT

Textbook Chapters

Deep Learning for Life Sciences Ramsundar, B., Eastman, P., Walters, P., & Pande, V. (2019). Chapter 8: Language Models. In Deep Learning for the Life Sciences. O’Reilly Media.

Biological Sequence Analysis Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.

What’s Next?

In Chapter 14: Next-Generation DNA Models, we’ll explore even more advanced architectures that push beyond the transformer paradigm:

These models address key limitations of current transformers:

Prerequisites for Chapter 14:

Coming up: We’ll see how alternative architectures can capture chromosome-scale context while remaining computationally tractable—opening new possibilities for understanding long-range gene regulation and structural variant effects.

[Continue to Chapter 14: Next-Generation DNA Models →]


This chapter is part of “AI for Biologists: From Genomic Variants to Cellular Models”
Licensed under CC BY-NC-SA 4.0