ai-for-genomic-science

Chapter 5: Genetic Variation and Genomic Technologies

Interactive: Chapter 5


4,100,000.

That is approximately how many positions in your DNA differ from the person sitting next to you in class right now. Not approximate — not “a lot” — a specific, enumerable number. If you printed those differences on paper, one variant per line, you would fill 82,000 pages. Stack those pages and the pile would be taller than you are.

Somewhere in that stack — perhaps on page 47,231 or page 81,003 — there might be a single line that explains why your family has a history of early-onset heart disease, or why you can taste bitter compounds in broccoli that your roommate cannot, or why a drug that works for most people gives you a side effect that mystifies your physician. One line in 82,000 pages. And that is assuming the answer lies in a single variant. Many traits and diseases arise from combinations: two variants that are harmless alone but disruptive together, or a common variant that only matters in a particular environmental context.

For most of human history, this stack was invisible. We knew that heredity existed, that traits ran in families, that identical twins were more similar than fraternal ones — but the molecular substrate was opaque. The sequencing revolution of the last two decades cracked it open. We can now read every letter of a person’s genome in a matter of days, for a cost that continues to fall. The stack has become legible. The problem, now, is navigation.

This chapter is about the technologies that generate the stack, the types of variants it contains, and the computational strategies that make it navigable. It is the problem that motivates everything that follows in this book — because before AI can help us interpret variation, we need to understand what variation is, how we measure it, and why the sheer scale of it demands something smarter than a manual search.


The Biological Challenge: Scale, Complexity, and the Noncoding Genome

Before we can apply AI to interpret genetic variants, we need to understand what we’re dealing with. The human genome presents several interconnected challenges:

Challenge 1: Sheer Volume

Each human genome contains approximately 4-5 million variants compared to the reference genome (GRCh38). These include:

If you examined one variant per minute, working 24/7, it would take over 7 years to review a single genome. We’re now sequencing hundreds of thousands of genomes per year. Manual review is impossible.

Challenge 2: Most of the Genome Isn’t Genes

Only about 1.5% of the human genome codes for proteins (exons). Yet variants outside protein-coding regions can have profound functional impacts:

A variant in an enhancer 500 kilobases upstream of a critical developmental gene can be just as impactful as a variant that changes an amino acid in that gene’s protein. But we have far less understanding of how noncoding variants work.

Challenge 3: Context Matters

The same DNA sequence can have completely different functions depending on:

A variant that’s harmless in most people might have functional impact when combined with other variants. We call these epistatic interactions, and they’re everywhere.

Challenge 4: Limited Experimental Capacity

How do we figure out what variants actually do? The gold standard is experimental testing:

But this approach costs $10,000-$50,000 per variant and takes weeks to months. With 4-5 million variants per genome, experimental validation of everything is neither practical nor affordable.

We need computational predictions to prioritize which variants are worth experimental follow-up.


Learning Objectives

By the end of this chapter, you will be able to:


5.1 Types of Genetic Variation

Human genetic diversity is beautiful and complex. Let’s start with the major categories of variation and what they mean biologically.

5.1.1 Single Nucleotide Variants (SNVs)

The most common type of genetic variation is the single nucleotide variant (SNV)—one DNA base differs from the reference genome.

Example:

Reference genome:  ...ATCG[A]GTCC...
Individual's DNA:  ...ATCG[G]GTCC...
                         ^
                     SNV: A→G

Key facts about SNVs:

Functional impact depends on location:

In protein-coding regions, SNVs can be:

In noncoding regions, SNVs can affect:

5.1.2 Insertions and Deletions (Indels)

Indels are small insertions or deletions, typically 1-50 base pairs.

Example of a 2 bp deletion:

Reference:  ...ATCGAA[GT]CCTA...
Individual: ...ATCGAA[--]CCTA...
                      deletion

Key facts about indels:

In protein-coding regions:

In noncoding regions:

5.1.3 Structural Variants (SVs)

Structural variants are large-scale genomic changes, typically >50 bp, often much larger (kilobases to megabases).

Types of SVs:

  1. Deletions: Large segments removed
    • Can delete entire genes or regulatory regions
    • Example: 22q11.2 deletion syndrome (DiGeorge syndrome)—3 megabase deletion affecting ~90 genes
  2. Duplications: Segments copied
    • Gene dosage effects (too much of a protein)
    • Example: PMP22 duplication causes Charcot-Marie-Tooth neuropathy
  3. Inversions: Segment flipped orientation
    • Can disrupt genes at breakpoints
    • Can separate genes from regulatory elements
  4. Translocations: Segments moved between chromosomes
    • Classic example: BCR-ABL fusion in chronic myeloid leukemia
  5. Copy number variants (CNVs): Deletions or duplications
    • ~5,000-10,000 per genome
    • Can affect gene dosage
    • Some are common and benign, others cause disorders

Detection challenges:

5.1.4 Variant Allele Frequency and Functional Impact

Not all variants are created equal. One key predictor of functional impact is how common a variant is in the population.

The logic:

Allele frequency categories:

Important caveat: This is a statistical tendency, not a rule. Some common variants do have functional impact (e.g., sickle cell allele is protective against malaria). And most rare variants are still neutral—they’re just rare because they arose recently.


5.2 Genome Sequencing Technologies

Understanding genetic variation requires the ability to accurately read DNA sequences. Let’s explore the major technologies and their trade-offs.

5.2.1 From Microarrays to Whole-Genome Sequencing

DNA Microarrays (SNP chips)

Whole-Exome Sequencing (WES)

Whole-Genome Sequencing (WGS)

The trend: WGS costs are dropping rapidly. By 2025, WGS is approaching the cost of WES, making it increasingly the preferred method.

5.2.2 Short-Read vs. Long-Read Sequencing

Short-Read Sequencing (Illumina)

Long-Read Sequencing

PacBio HiFi

Oxford Nanopore

The future: Long-read sequencing is improving rapidly in accuracy and cost. Within 5-10 years, it may become the standard for comprehensive genomic analysis.

5.2.3 Sequencing Depth and Coverage

Depth and coverage are critical concepts for understanding sequencing quality.

Sequencing depth (or coverage): Number of times a given base is sequenced

Typical depths:

Why does depth matter?

At 10× depth:

At 30× depth:

At 100× depth:

Coverage uniformity also matters:


5.3 The Noncoding Genome: Regulatory Elements

The 98.5% of the genome that doesn’t code for proteins is not “junk DNA.” It’s a vast regulatory network controlling when, where, and how much genes are expressed.

5.3.1 Types of Regulatory Elements

Promoters

Enhancers

Classic example: A variant in an enhancer 1 megabase from the SOX9 gene causes campomelic dysplasia (skeletal malformation). The variant doesn’t touch the gene, but disrupts its expression during bone development.

Silencers

Insulators

Splice Sites

Example: Many variants in the BRCA1 gene associated with cancer risk are actually splice site variants, not amino acid changes.

5.3.2 Chromatin States and Epigenetic Marks

DNA doesn’t exist naked in the cell—it’s wrapped around histone proteins, forming chromatin. The state of chromatin determines which parts of the genome are accessible.

Histone modifications are chemical tags on histone proteins that mark functional states:

Chromatin accessibility (open vs. closed):

Why this matters for variant interpretation:


5.4 Experimental Data: The Foundation for AI Models

To build AI models that predict variant impacts, we need training data from experiments. Let’s look at the major data types.

5.4.1 Chromatin Immunoprecipitation Sequencing (ChIP-seq)

What it measures: Where specific proteins bind to DNA genome-wide

How it works:

  1. Crosslink proteins to DNA (formaldehyde)
  2. Fragment DNA into small pieces
  3. Use antibodies to pull down specific protein (e.g., a transcription factor)
  4. Sequence the DNA fragments that were bound
  5. Map reads to genome to find binding sites

Data output:

Common applications:

Example dataset: ENCODE has ChIP-seq data for:

5.4.2 Assays for Chromatin Accessibility

DNase-seq (DNase I hypersensitivity sequencing)

ATAC-seq (Assay for Transposase-Accessible Chromatin)

5.4.3 Chromosome Conformation Capture (3C and derivatives)

DNA forms 3D structures in the nucleus. Distant genomic regions physically contact each other, especially enhancer-promoter pairs.

Hi-C: Genome-wide contacts

Promoter-Capture Hi-C

Why this matters:

5.4.4 Massively Parallel Reporter Assays (MPRAs)

The functional testing problem: We can measure chromatin states, but do variants actually affect gene expression?

MPRA solution: Test thousands of variants simultaneously

How it works:

  1. Synthesize thousands of DNA sequences (with and without variants)
  2. Each sequence gets a unique DNA barcode
  3. Insert into reporter constructs (sequence → barcode → reporter gene)
  4. Transfect into cells
  5. Measure reporter gene expression by sequencing barcodes
  6. Compare expression between variant and reference sequences

Example results:

Limitations:

But: Provides ground truth labels for training AI models!


5.5 The Variant Interpretation Challenge

Now we can see the full scope of the problem. Let’s bring it together.

5.5.1 The Clinical Genetics Workflow

When a patient gets whole-genome or whole-exome sequencing, here’s what happens:

Step 1: Variant Calling

Step 2: Annotation

Step 3: Filtering

Step 4: Interpretation

Step 5: Reporting

The bottleneck: Steps 3-5 require expert manual review. A clinical geneticist might spend 4-8 hours per case. We’re sequencing faster than we can interpret.

5.5.2 Why Noncoding Variants Are Harder

Coding variants have clear rules:

Noncoding variants are murkier:

Example challenge:

5.5.3 The Promise of AI

This is where AI shines. AI models can:

  1. Learn patterns from large-scale experimental data
    • ENCODE, Roadmap Epigenomics: thousands of experiments
    • Learn what makes a functional enhancer
    • Learn chromatin signatures of regulatory elements
  2. Integrate multiple data types
    • Sequence + chromatin state + conservation + 3D structure
    • Capture complex interactions humans can’t easily describe
  3. Make predictions for every possible variant
    • Once trained, predict in silico (no experiments needed)
    • Fast: millions of variants in minutes
    • Consistent: same variant always gets same score
  4. Prioritize variants for experimental validation
    • Rank by predicted impact
    • Focus experimental resources on high-priority variants

What we need from AI models:


Summary

Key Takeaways


📖 Key Terms | Term | Definition | |------|-----------| | **Allele frequency** | Proportion of individuals in a population carrying a particular variant; rare variants are enriched for functional impacts | | **ATAC-seq** | Assay for Transposase-Accessible Chromatin; uses Tn5 transposase to map open chromatin regions genome-wide | | **ChIP-seq** | Chromatin Immunoprecipitation sequencing; identifies genome-wide binding sites of specific proteins (transcription factors, modified histones) | | **De novo variant** | A genetic variant present in an individual but absent from both parents; arises as a new mutation | | **Enhancer** | Regulatory DNA sequence that increases gene transcription; can be located far from target gene and functions in cell-type-specific manner | | **Epistasis** | Interaction between different genetic variants; functional impact of one variant depends on presence of others | | **Frameshift** | Insertion or deletion that's not a multiple of 3 base pairs; shifts reading frame and alters all downstream amino acids | | **Hi-C** | Chromosome conformation capture method that maps all pairwise 3D contacts across the genome; reveals enhancer-promoter interactions | | **Indel** | Insertion or deletion of DNA sequence, typically 1-50 base pairs in length | | **Missense variant** | SNV that changes one amino acid to another in a protein sequence | | **MPRA** | Massively Parallel Reporter Assay; tests functional impact of thousands of variants simultaneously by measuring reporter gene expression | | **Nonsense variant** | SNV that creates a premature stop codon, typically resulting in truncated protein | | **Promoter** | Regulatory region immediately upstream of gene that controls transcription initiation; binding site for RNA polymerase and transcription factors | | **Splice site** | Sequence at exon-intron boundary that directs splicing; variants can cause exon skipping or intron retention | | **Structural variant (SV)** | Large-scale genomic alteration >50 bp, including deletions, duplications, inversions, and translocations | | **Synonymous variant** | SNV that changes DNA sequence but not amino acid due to genetic code redundancy; usually neutral but can affect splicing | | **Variant of uncertain significance (VUS)** | A genetic variant whose functional impact and clinical relevance are unclear; requires additional evidence | | **Whole-exome sequencing (WES)** | Sequencing of protein-coding regions (~1.5% of genome); standard for clinical diagnostics | | **Whole-genome sequencing (WGS)** | Sequencing of entire genome including noncoding regions; provides comprehensive variant detection |

Test Your Understanding

  1. Variant frequency and function: You find a missense variant in a cancer-related gene. In one database, it’s present at 0.01% frequency. In another, it’s at 2% frequency. How would these different frequencies affect your interpretation of whether this variant has functional impact? What might explain the discrepancy?

  2. Technology trade-offs: A research team wants to study structural variants associated with autism spectrum disorder. They have budget for either: (a) short-read WGS of 1,000 individuals at 30× coverage, or (b) long-read WGS of 200 individuals at 30× coverage. What would you recommend and why? What types of variants might be missed or gained with each approach?

  3. Noncoding complexity: Imagine a variant in an enhancer 800 kb upstream of a neurodevelopmental gene. The variant is heterozygous (one copy affected). It reduces enhancer activity by 25% specifically in developing neurons. Why would this be difficult to detect experimentally? Why might patients with this variant show variable phenotypes?

  4. Experimental validation priorities: You’ve identified 50 rare variants in regulatory elements near genes associated with congenital heart defects. You can functionally test 5 variants using MPRA. What additional information would help you prioritize which 5 to test? Consider sequence features, chromatin states, conservation, and Hi-C data.

  5. Epistasis: Two variants individually have no functional impact, but when present together in the same individual, they cause a developmental disorder. Why does this make clinical interpretation challenging? How might AI models help (or struggle) with detecting such epistatic interactions?

  6. De novo variants: A patient has a severe developmental disorder, but trio sequencing reveals only 2 de novo coding variants, both missense variants in genes not previously associated with any disorder. Neither variant is in a highly conserved position. How would you approach interpreting whether one of these variants is responsible? What if the causative variant were actually in a noncoding regulatory element?

  7. Cell-type specificity: A variant falls within an enhancer that’s active only in pancreatic beta cells during embryonic development (weeks 8-12). The patient has neonatal diabetes. Why is it essentially impossible to functionally validate this variant in the patient’s cells? How could AI models trained on embryonic pancreas chromatin data help?

  8. Coverage considerations: When doing clinical WGS, some labs use 30× coverage while others use 60×. For a heterozygous rare variant, why might 30× coverage sometimes miss the variant? If budget is limited, what’s better: sequencing more patients at lower coverage or fewer patients at higher coverage?


Further Reading

Foundational Papers

Reviews for Understanding Regulatory Elements

Sequencing Technologies

Clinical Variant Interpretation

Online Resources


What’s Next?

Now that you understand the scale and complexity of genetic variation, we’re ready to explore how earlier computational tools attempted to tackle variant interpretation.

In Chapter 6, we’ll explore:

Before moving on, make sure you:

Ready? Let’s see how conservation analysis and traditional tools approach variant interpretation!