ai-for-genomic-science

Chapter 5: Genetic Variation and Genomic Technologies

4,100,000.

That is approximately how many positions in your DNA differ from the person sitting next to you in class right now. Not approximate — not “a lot” — a specific, enumerable number. If you printed those differences on paper, one variant per line, you would fill 82,000 pages. Stack those pages and the pile would be taller than you are.

Somewhere in that stack — perhaps on page 47,231 or page 81,003 — there might be a single line that explains why your family has a history of early-onset heart disease, or why you can taste bitter compounds in broccoli that your roommate cannot, or why a drug that works for most people gives you a side effect that mystifies your physician. One line in 82,000 pages. And that is assuming the answer lies in a single variant. Many traits and diseases arise from combinations: two variants that are harmless alone but disruptive together, or a common variant that only matters in a particular environmental context.

For most of human history, this stack was invisible. We knew that heredity existed, that traits ran in families, that identical twins were more similar than fraternal ones — but the molecular substrate was opaque. The sequencing revolution of the last two decades cracked it open. We can now read every letter of a person’s genome in a matter of days, for a cost that continues to fall. The stack has become legible. The problem, now, is navigation.

This chapter is about the technologies that generate the stack, the types of variants it contains, and the computational strategies that make it navigable. It is the problem that motivates everything that follows in this book — because before AI can help us interpret variation, we need to understand what variation is, how we measure it, and why the sheer scale of it demands something smarter than a manual search.

The Biological Challenge: Scale, Complexity, and the Noncoding Genome

Before we can apply AI to interpret genetic variants, we need to understand what we’re dealing with. The human genome presents several interconnected challenges:

Challenge 1: Sheer Volume

Each human genome contains approximately 4-5 million variants compared to the reference genome (GRCh38). These include:

~3.5 million single nucleotide variants (SNVs)
~500,000 small insertions and deletions (indels, 1-50 bp)
~5,000-10,000 structural variants (>50 bp)

If you examined one variant per minute, working 24/7, it would take over 7 years to review a single genome. We’re now sequencing hundreds of thousands of genomes per year. Manual review is impossible.

Challenge 2: Most of the Genome Isn’t Genes

Only about 1.5% of the human genome codes for proteins (exons). Yet variants outside protein-coding regions can have profound functional impacts:

Promoters control when and where genes are expressed
Enhancers can be located millions of base pairs away from their target genes
Silencers repress gene expression
Splice sites determine which parts of genes are included in final transcripts
Long-range regulatory elements coordinate expression of multiple genes

A variant in an enhancer 500 kilobases upstream of a critical developmental gene can be just as impactful as a variant that changes an amino acid in that gene’s protein. But we have far less understanding of how noncoding variants work.

Challenge 3: Context Matters

The same DNA sequence can have completely different functions depending on:

Cell type: An enhancer active in neurons might be inactive in liver cells
Developmental stage: Critical during embryogenesis, irrelevant in adults
Environmental conditions: Some regulatory elements respond to hormones, nutrients, or stress
Genetic background: Variants interact with each other

A variant that’s harmless in most people might have functional impact when combined with other variants. We call these epistatic interactions, and they’re everywhere.

Challenge 4: Limited Experimental Capacity

How do we figure out what variants actually do? The gold standard is experimental testing:

Introduce the variant into cells using CRISPR
Measure gene expression changes
Assess cellular phenotypes
Test in animal models

But this approach costs $10,000-$50,000 per variant and takes weeks to months. With 4-5 million variants per genome, experimental validation of everything is neither practical nor affordable.

We need computational predictions to prioritize which variants are worth experimental follow-up.

Learning Objectives

By the end of this chapter, you will be able to:

Distinguish between SNVs, indels, and structural variants and explain their relative frequencies
Describe the major genome sequencing technologies and their advantages for variant detection
Explain why noncoding variants are harder to interpret than coding variants
Identify the key functional categories of noncoding regulatory elements
Understand how variant allele frequency relates to functional impact
Explain why experimental validation cannot scale to millions of variants
Describe the concept of variant pathogenicity and why it’s complex for noncoding variants

5.1 Types of Genetic Variation

Human genetic diversity is beautiful and complex. Let’s start with the major categories of variation and what they mean biologically.

5.1.1 Single Nucleotide Variants (SNVs)

The most common type of genetic variation is the single nucleotide variant (SNV)—one DNA base differs from the reference genome.

Example:

Reference genome:  ...ATCG[A]GTCC...
Individual's DNA:  ...ATCG[G]GTCC...
                         ^
                     SNV: A→G

Key facts about SNVs:

Make up ~85% of all variants
~3.5 million SNVs per individual genome
Most are inherited from parents (present in germline)
Some are de novo (new mutations not in parents)
Distribution: ~38,000 in coding exons, most in noncoding regions

Functional impact depends on location:

In protein-coding regions, SNVs can be:

Synonymous: Change the DNA but not the amino acid (due to genetic code redundancy)
- Example: GAA → GAG (both code for glutamate)
- Usually neutral, though can affect splicing or mRNA stability
Missense: Change one amino acid to another
- Example: GAA (glutamate) → AAA (lysine)
- Impact ranges from neutral to severe depending on:
  - Chemical properties (charged → uncharged is more disruptive)
  - Location (active site vs. surface)
  - Conservation (changing a conserved residue is riskier)
Nonsense: Create a premature stop codon
- Example: TGG (tryptophan) → TGA (stop)
- Usually severe—truncated protein often nonfunctional
- Can trigger nonsense-mediated decay (mRNA destroyed)

In noncoding regions, SNVs can affect:

Transcription factor binding sites
Splicing signals
RNA secondary structure
Long-range regulatory interactions

5.1.2 Insertions and Deletions (Indels)

Indels are small insertions or deletions, typically 1-50 base pairs.

Example of a 2 bp deletion:

Reference:  ...ATCGAA[GT]CCTA...
Individual: ...ATCGAA[--]CCTA...
                      deletion

Key facts about indels:

~500,000 indels per genome
~15% of all variants
Can be inherited or de novo
Particularly important in repetitive sequences

In protein-coding regions:

Frameshift indels: Not a multiple of 3 bp
- Shift the reading frame
- Usually severe—completely alters downstream protein sequence
- Example: Deletion of 2 bp shifts all subsequent codons
In-frame indels: Multiples of 3 bp
- Add or remove amino acids without shifting frame
- Impact varies (can be neutral or severe)
- Example: 3 bp deletion removes one amino acid

In noncoding regions:

Can create or destroy regulatory motifs
Affect spacing between regulatory elements
Alter DNA shape and flexibility

5.1.3 Structural Variants (SVs)

Structural variants are large-scale genomic changes, typically >50 bp, often much larger (kilobases to megabases).

Types of SVs:

Deletions: Large segments removed
- Can delete entire genes or regulatory regions
- Example: 22q11.2 deletion syndrome (DiGeorge syndrome)—3 megabase deletion affecting ~90 genes
Duplications: Segments copied
- Gene dosage effects (too much of a protein)
- Example: PMP22 duplication causes Charcot-Marie-Tooth neuropathy
Inversions: Segment flipped orientation
- Can disrupt genes at breakpoints
- Can separate genes from regulatory elements
Translocations: Segments moved between chromosomes
- Classic example: BCR-ABL fusion in chronic myeloid leukemia
Copy number variants (CNVs): Deletions or duplications
- ~5,000-10,000 per genome
- Can affect gene dosage
- Some are common and benign, others cause disorders

Detection challenges:

Require specialized analysis methods
Short-read sequencing can miss some SVs
Long-read sequencing (PacBio, Oxford Nanopore) improves detection
Often underappreciated in clinical diagnostics

5.1.4 Variant Allele Frequency and Functional Impact

Not all variants are created equal. One key predictor of functional impact is how common a variant is in the population.

The logic:

Variants with severe functional impact reduce reproductive fitness
Natural selection removes them from the population over generations
Therefore, common variants are usually neutral or mildly beneficial
Rare variants are enriched for those with functional impact

Allele frequency categories:

Common (>1% frequency):
- Present in millions of people
- Generally neutral or mildly deleterious
- Examples: APOE ε4 allele (~15% frequency, risk factor for Alzheimer’s)
Low frequency (0.1-1%):
- Present in thousands to millions
- Mixed bag—some neutral, some with mild effects
Rare (<0.1%):
- Present in dozens to thousands
- Enriched for functional variants
- Focus of clinical sequencing
Ultra-rare (<0.01%):
- Seen in very few individuals or families
- Highest enrichment for disorder-causing variants
- But also includes recent neutral mutations
De novo (not in parents):
- New mutation in the individual
- ~70 de novo variants per person
- Important in conditions affecting reproductive fitness

Important caveat: This is a statistical tendency, not a rule. Some common variants do have functional impact (e.g., sickle cell allele is protective against malaria). And most rare variants are still neutral—they’re just rare because they arose recently.

5.2 Genome Sequencing Technologies

Understanding genetic variation requires the ability to accurately read DNA sequences. Let’s explore the major technologies and their trade-offs.

5.2.1 From Microarrays to Whole-Genome Sequencing

DNA Microarrays (SNP chips)

Principle: Hybridization-based detection of known SNPs
Coverage: 500,000 to 5 million pre-selected SNPs
Advantages:
- Cheap ($50-$100 per sample)
- Fast (thousands of samples in parallel)
- Great for population genetics and GWAS
Limitations:
- Only detects pre-selected variants (can’t discover new ones)
- Misses indels, structural variants, rare variants
- Not useful for clinical diagnostics
Use cases: Ancestry testing (23andMe), GWAS studies, agriculture

Whole-Exome Sequencing (WES)

Principle: Sequence only protein-coding exons (~1.5% of genome)
Coverage: ~20,000 genes, ~180,000 exons
Advantages:
- Cheaper than whole-genome ($300-$500)
- High depth of coverage (>100×)
- Enriched for functional variants in genes
- Standard for clinical diagnostics
Limitations:
- Misses noncoding variants (promoters, enhancers, etc.)
- Variable coverage at exon boundaries
- Misses ~2% of exons due to capture inefficiency
Use cases: Clinical diagnostics, studies of Mendelian disorders

Whole-Genome Sequencing (WGS)

Principle: Sequence the entire genome
Coverage: All 3.2 billion base pairs
Advantages:
- Detects all variant types (SNVs, indels, SVs)
- Covers noncoding regulatory regions
- More uniform coverage than WES
- Can detect repeat expansions, some epigenetic marks
Limitations:
- More expensive ($600-$1,000, dropping fast)
- Generates huge amounts of data (~100 GB per genome)
- Harder to interpret (97% is noncoding)
Use cases: Research, comprehensive clinical diagnostics, population genomics

The trend: WGS costs are dropping rapidly. By 2025, WGS is approaching the cost of WES, making it increasingly the preferred method.

5.2.2 Short-Read vs. Long-Read Sequencing

Short-Read Sequencing (Illumina)

Read length: 150-300 base pairs
Advantages:
- Extremely accurate (>99.9%)
- High throughput (billions of reads)
- Cost-effective
- Dominant platform (>80% of market)
Limitations:
- Struggles with repetitive regions
- Can’t span some structural variants
- Difficult to phase variants (determine if on same chromosome)
- Hard to assemble de novo genomes
How it works: Sequencing by synthesis—fluorescently labeled nucleotides incorporated one at a time

Long-Read Sequencing

PacBio HiFi

Read length: 10,000-25,000 base pairs (10-25 kb)
Accuracy: ~99.9% (after circular consensus)
Advantages:
- Spans repetitive regions
- Can phase variants directly
- Better SV detection
- Reads through complex genomic regions
Use cases: De novo assembly, SVs, complex regions

Oxford Nanopore

Read length: Up to 2 million base pairs (ultra-long reads!)
Accuracy: ~95-99% (improving rapidly)
Advantages:
- Longest reads available
- Real-time sequencing
- Portable (MinION device fits in your hand)
- Can detect base modifications (methylation) directly
Unique feature: Measures ionic current changes as DNA passes through nanopore
Use cases: Structural variants, repeat expansions, rapid diagnostics

The future: Long-read sequencing is improving rapidly in accuracy and cost. Within 5-10 years, it may become the standard for comprehensive genomic analysis.

5.2.3 Sequencing Depth and Coverage

Depth and coverage are critical concepts for understanding sequencing quality.

Sequencing depth (or coverage): Number of times a given base is sequenced

Notation: “30× coverage” means each base is sequenced ~30 times on average
Higher depth = more confident variant calls

Typical depths:

WES: 80-100× in targeted regions
WGS: 30-40× for clinical, 10-15× for population studies
Single-cell: 0.01-1× (very sparse!)

Why does depth matter?

At 10× depth:

Can reliably detect homozygous variants
Will miss some heterozygous variants (need reads from both alleles)
Difficult to distinguish true variants from sequencing errors

At 30× depth:

Reliable heterozygous variant calling
Standard for clinical WGS
Good balance of cost and accuracy

At 100× depth:

Very high confidence
Can detect low-frequency somatic variants (cancer, mosaicism)
Expensive (3× more data than 30×)

Coverage uniformity also matters:

Some regions are hard to sequence (GC-rich, repetitive)
“30× average coverage” might mean some regions are only 5×
Exome capture creates variable coverage across exons

5.3 The Noncoding Genome: Regulatory Elements

The 98.5% of the genome that doesn’t code for proteins is not “junk DNA.” It’s a vast regulatory network controlling when, where, and how much genes are expressed.

5.3.1 Types of Regulatory Elements

Promoters

Location: Immediately upstream of gene start site (typically within 1 kb)
Function: Binding site for RNA polymerase and transcription factors
Size: ~100-1,000 bp
Key features:
- Core promoter elements (TATA box, Inr, DPE)
- Transcription factor binding sites
- Often CpG-rich (clusters of CG dinucleotides)
Impact of variants: Can dramatically alter gene expression levels

Enhancers

Location: Can be far from target gene (often 50-500 kb away, sometimes >1 Mb!)
Function: Boost transcription of target genes in specific cell types/conditions
Size: ~50-1,500 bp
Key features:
- Multiple transcription factor binding sites (TFBSs)
- Cell-type specific activity
- Can regulate multiple genes
- DNA loops to contact promoters
Numbers: ~400,000-1,000,000 enhancers in human genome
Impact of variants: Can cause disorders by altering gene expression in specific tissues

Classic example: A variant in an enhancer 1 megabase from the SOX9 gene causes campomelic dysplasia (skeletal malformation). The variant doesn’t touch the gene, but disrupts its expression during bone development.

Silencers

Function: Repress gene transcription
Mechanism: Recruit repressive transcription factors
Less understood: Historically harder to identify than enhancers
Impact: Variants can cause inappropriate gene activation

Insulators

Function: Block enhancer-promoter interactions
Mechanism: Create boundaries between regulatory domains
Key protein: CTCF (binds insulator sites)
Impact: Variants can allow enhancers to act on wrong genes

Splice Sites

Location: Exon-intron boundaries
Function: Determine which exons are included in mature mRNA
Critical sequences:
- Donor site (GT at start of intron): ...exon|GT...intron
- Acceptor site (AG at end of intron): ...intron...AG|exon...
- Branch point (A typically ~20-50 bp before acceptor)
Impact: Variants can cause:
- Exon skipping (entire exon missing from mRNA)
- Intron retention (intron not removed)
- Cryptic splice site activation (wrong site used)
- These often lead to frameshift or nonsense-mediated decay

Example: Many variants in the BRCA1 gene associated with cancer risk are actually splice site variants, not amino acid changes.

5.3.2 Chromatin States and Epigenetic Marks

DNA doesn’t exist naked in the cell—it’s wrapped around histone proteins, forming chromatin. The state of chromatin determines which parts of the genome are accessible.

Histone modifications are chemical tags on histone proteins that mark functional states:

H3K4me3 (histone H3, lysine 4, tri-methylated):
- Mark of active promoters
- Sharp peaks at transcription start sites
- Used to identify active genes
H3K4me1 (histone H3, lysine 4, mono-methylated):
- Mark of enhancers
- Broader regions
- Often co-occurs with H3K27ac at active enhancers
H3K27ac (histone H3, lysine 27, acetylated):
- Mark of active regulatory regions
- Distinguishes active from poised/inactive enhancers
- Strong predictor of functional impact
H3K27me3 (histone H3, lysine 27, tri-methylated):
- Mark of repressed regions
- Often at developmentally silenced genes
- Polycomb-mediated repression
H3K9me3 (histone H3, lysine 9, tri-methylated):
- Mark of heterochromatin
- Permanently silent regions
- Repetitive elements, centromeres

Chromatin accessibility (open vs. closed):

Open chromatin: DNA accessible to transcription factors
- Measured by: DNase-seq, ATAC-seq
- Indicates regulatory regions
Closed chromatin: DNA tightly wrapped, inaccessible
- Inactive regulatory regions
- Silent heterochromatin

Why this matters for variant interpretation:

Variants in open chromatin with active histone marks are more likely to have functional impact
Variants in closed, repressed chromatin are less likely to matter
Cell-type specificity: Same variant might be in open chromatin in neurons but closed in liver cells

5.4 Experimental Data: The Foundation for AI Models

To build AI models that predict variant impacts, we need training data from experiments. Let’s look at the major data types.

5.4.1 Chromatin Immunoprecipitation Sequencing (ChIP-seq)

What it measures: Where specific proteins bind to DNA genome-wide

How it works:

Crosslink proteins to DNA (formaldehyde)
Fragment DNA into small pieces
Use antibodies to pull down specific protein (e.g., a transcription factor)
Sequence the DNA fragments that were bound
Map reads to genome to find binding sites

Data output:

Peaks (regions with high read counts)
Indicates where protein was bound
Resolution: ~100-300 bp

Common applications:

Transcription factor ChIP-seq: Find binding sites for specific TFs
Histone modification ChIP-seq: Map H3K4me3, H3K27ac, etc.
RNA Polymerase II ChIP-seq: Find actively transcribed regions

Example dataset: ENCODE has ChIP-seq data for:

~600 transcription factors
Dozens of histone modifications
Across ~100 cell types
Millions of regulatory elements characterized

5.4.2 Assays for Chromatin Accessibility

DNase-seq (DNase I hypersensitivity sequencing)

Principle: DNase I enzyme cuts accessible DNA
Measures: Regulatory elements, TF footprints
Resolution: ~50-200 bp
Advantages: Gold standard, high resolution
Limitations: Requires many cells, technically challenging

ATAC-seq (Assay for Transposase-Accessible Chromatin)

Principle: Tn5 transposase inserts sequencing adapters into accessible DNA
Measures: Open chromatin regions
Resolution: ~50-200 bp
Advantages:
- Easy protocol
- Requires few cells (50,000 cells, or even single cells!)
- Fast
Limitations: More background noise than DNase-seq
Status: Becoming the dominant method

5.4.3 Chromosome Conformation Capture (3C and derivatives)

DNA forms 3D structures in the nucleus. Distant genomic regions physically contact each other, especially enhancer-promoter pairs.

Hi-C: Genome-wide contacts

Measures: All pairwise interactions across the genome
Resolution: 1 kb to 1 Mb (depends on sequencing depth)
Reveals:
- Topologically associating domains (TADs)
- A/B compartments (active vs. inactive)
- Specific enhancer-promoter loops
Data size: Huge (billions of read pairs per experiment)

Promoter-Capture Hi-C

Focus: Interactions involving promoters specifically
Advantages: Enriched for functional regulatory contacts
Use: Linking enhancers to target genes

Why this matters:

An enhancer variant only matters if it regulates an important gene
Hi-C tells us which genes an enhancer likely controls
Critical for interpreting noncoding variants

5.4.4 Massively Parallel Reporter Assays (MPRAs)

The functional testing problem: We can measure chromatin states, but do variants actually affect gene expression?

MPRA solution: Test thousands of variants simultaneously

How it works:

Synthesize thousands of DNA sequences (with and without variants)
Each sequence gets a unique DNA barcode
Insert into reporter constructs (sequence → barcode → reporter gene)
Transfect into cells
Measure reporter gene expression by sequencing barcodes
Compare expression between variant and reference sequences

Example results:

“Variant A increases expression 2-fold” → likely functional
“Variant B has no effect” → likely neutral
Test 10,000+ variants in a single experiment

Limitations:

Not in native genomic context
May miss long-range effects
Cell type and condition specific
Expensive and technically challenging

But: Provides ground truth labels for training AI models!

5.5 The Variant Interpretation Challenge

Now we can see the full scope of the problem. Let’s bring it together.

5.5.1 The Clinical Genetics Workflow

When a patient gets whole-genome or whole-exome sequencing, here’s what happens:

Step 1: Variant Calling

Align sequencing reads to reference genome
Identify positions that differ (variants)
Filter out low-quality calls
Output: 4-5 million variants (WGS) or ~50,000 (WES)

Step 2: Annotation

Determine genomic context of each variant
- Coding vs. noncoding
- Gene name
- Consequence type (missense, synonymous, splice site, etc.)
Add population frequencies from databases (gnomAD)
Add known associations (ClinVar, OMIM)

Step 3: Filtering

Remove common variants (>1% frequency) → Usually neutral
Focus on rare variants in genes relevant to phenotype
Prioritize by predicted functional impact
This is where AI comes in!

Step 4: Interpretation

Review remaining candidate variants
Check literature on genes
Evaluate evidence for functional impact
Assess whether variant explains patient phenotype

Step 5: Reporting

Classify variants:
- Disorder-causing (pathogenic)
- Likely disorder-causing (likely pathogenic)
- Uncertain significance (VUS)
- Likely neutral (likely benign)
- Neutral (benign)
Report clinically relevant findings to physician

The bottleneck: Steps 3-5 require expert manual review. A clinical geneticist might spend 4-8 hours per case. We’re sequencing faster than we can interpret.

5.5.2 Why Noncoding Variants Are Harder

Coding variants have clear rules:

Stop codons are usually bad
Frameshifts are usually bad
Changing a conserved amino acid in an active site is probably bad
We understand the genetic code perfectly

Noncoding variants are murkier:

We don’t know where all regulatory elements are
Even when we know location, we don’t fully understand the code
- Enhancers have complex grammars of multiple TF binding sites
- Spacing and orientation matter
- Context-dependent (cell type, developmental stage)
Effects can be subtle (20% change in expression, not complete loss)
Variants can affect tissue-specific expression (hard to detect experimentally)

Example challenge:

A variant in an enhancer might reduce expression of gene X by 30% specifically in developing neurons during weeks 8-12 of embryonic development
No way to test this experimentally in humans
Even patient-derived neurons in a dish won’t perfectly recapitulate developmental timing
We need computational predictions

5.5.3 The Promise of AI

This is where AI shines. AI models can:

Learn patterns from large-scale experimental data
- ENCODE, Roadmap Epigenomics: thousands of experiments
- Learn what makes a functional enhancer
- Learn chromatin signatures of regulatory elements
Integrate multiple data types
- Sequence + chromatin state + conservation + 3D structure
- Capture complex interactions humans can’t easily describe
Make predictions for every possible variant
- Once trained, predict in silico (no experiments needed)
- Fast: millions of variants in minutes
- Consistent: same variant always gets same score
Prioritize variants for experimental validation
- Rank by predicted impact
- Focus experimental resources on high-priority variants

What we need from AI models:

Accurate predictions of functional impact
Calibrated confidence scores (know when predictions are uncertain)
Interpretable results (why did the model make this prediction?)
Generalization across cell types and conditions

Summary

Key Takeaways

Human genomes contain 4-5 million variants compared to reference; most are neutral, but finding functional variants is critical for understanding genetic disorders
Variant types include SNVs (~85%), indels (~15%), and structural variants; each has distinct functional consequences and detection challenges
Sequencing technologies range from cheap microarrays (known variants only) to comprehensive WGS (all variant types); long-read sequencing is improving detection of complex variants
The noncoding genome (98.5%) contains vast regulatory networks including promoters, enhancers, silencers, and insulators; these control gene expression in cell-type and condition-specific ways
Experimental data (ChIP-seq, ATAC-seq, Hi-C, MPRAs) reveal regulatory element locations and functions, providing training data for AI models
Chromatin states (histone modifications, accessibility) indicate functional activity; variants in open, active chromatin are more likely to have functional impact
Clinical variant interpretation requires filtering millions of variants to identify a handful of candidates; manual review is time-consuming and doesn’t scale
Noncoding variants are especially challenging due to incomplete understanding of regulatory codes, context-dependence, and difficulty of experimental validation
AI’s promise is learning patterns from large-scale data to predict variant impacts accurately, consistently, and at scale

📖 Key Terms

| Term | Definition | |------|-----------| | **Allele frequency** | Proportion of individuals in a population carrying a particular variant; rare variants are enriched for functional impacts | | **ATAC-seq** | Assay for Transposase-Accessible Chromatin; uses Tn5 transposase to map open chromatin regions genome-wide | | **ChIP-seq** | Chromatin Immunoprecipitation sequencing; identifies genome-wide binding sites of specific proteins (transcription factors, modified histones) | | **De novo variant** | A genetic variant present in an individual but absent from both parents; arises as a new mutation | | **Enhancer** | Regulatory DNA sequence that increases gene transcription; can be located far from target gene and functions in cell-type-specific manner | | **Epistasis** | Interaction between different genetic variants; functional impact of one variant depends on presence of others | | **Frameshift** | Insertion or deletion that's not a multiple of 3 base pairs; shifts reading frame and alters all downstream amino acids | | **Hi-C** | Chromosome conformation capture method that maps all pairwise 3D contacts across the genome; reveals enhancer-promoter interactions | | **Indel** | Insertion or deletion of DNA sequence, typically 1-50 base pairs in length | | **Missense variant** | SNV that changes one amino acid to another in a protein sequence | | **MPRA** | Massively Parallel Reporter Assay; tests functional impact of thousands of variants simultaneously by measuring reporter gene expression | | **Nonsense variant** | SNV that creates a premature stop codon, typically resulting in truncated protein | | **Promoter** | Regulatory region immediately upstream of gene that controls transcription initiation; binding site for RNA polymerase and transcription factors | | **Splice site** | Sequence at exon-intron boundary that directs splicing; variants can cause exon skipping or intron retention | | **Structural variant (SV)** | Large-scale genomic alteration >50 bp, including deletions, duplications, inversions, and translocations | | **Synonymous variant** | SNV that changes DNA sequence but not amino acid due to genetic code redundancy; usually neutral but can affect splicing | | **Variant of uncertain significance (VUS)** | A genetic variant whose functional impact and clinical relevance are unclear; requires additional evidence | | **Whole-exome sequencing (WES)** | Sequencing of protein-coding regions (~1.5% of genome); standard for clinical diagnostics | | **Whole-genome sequencing (WGS)** | Sequencing of entire genome including noncoding regions; provides comprehensive variant detection |

Test Your Understanding

Variant frequency and function: You find a missense variant in a cancer-related gene. In one database, it’s present at 0.01% frequency. In another, it’s at 2% frequency. How would these different frequencies affect your interpretation of whether this variant has functional impact? What might explain the discrepancy?
Technology trade-offs: A research team wants to study structural variants associated with autism spectrum disorder. They have budget for either: (a) short-read WGS of 1,000 individuals at 30× coverage, or (b) long-read WGS of 200 individuals at 30× coverage. What would you recommend and why? What types of variants might be missed or gained with each approach?
Noncoding complexity: Imagine a variant in an enhancer 800 kb upstream of a neurodevelopmental gene. The variant is heterozygous (one copy affected). It reduces enhancer activity by 25% specifically in developing neurons. Why would this be difficult to detect experimentally? Why might patients with this variant show variable phenotypes?
Experimental validation priorities: You’ve identified 50 rare variants in regulatory elements near genes associated with congenital heart defects. You can functionally test 5 variants using MPRA. What additional information would help you prioritize which 5 to test? Consider sequence features, chromatin states, conservation, and Hi-C data.
Epistasis: Two variants individually have no functional impact, but when present together in the same individual, they cause a developmental disorder. Why does this make clinical interpretation challenging? How might AI models help (or struggle) with detecting such epistatic interactions?
De novo variants: A patient has a severe developmental disorder, but trio sequencing reveals only 2 de novo coding variants, both missense variants in genes not previously associated with any disorder. Neither variant is in a highly conserved position. How would you approach interpreting whether one of these variants is responsible? What if the causative variant were actually in a noncoding regulatory element?
Cell-type specificity: A variant falls within an enhancer that’s active only in pancreatic beta cells during embryonic development (weeks 8-12). The patient has neonatal diabetes. Why is it essentially impossible to functionally validate this variant in the patient’s cells? How could AI models trained on embryonic pancreas chromatin data help?
Coverage considerations: When doing clinical WGS, some labs use 30× coverage while others use 60×. For a heterozygous rare variant, why might 30× coverage sometimes miss the variant? If budget is limited, what’s better: sequencing more patients at lower coverage or fewer patients at higher coverage?

What’s Next?

Now that you understand the scale and complexity of genetic variation, we’re ready to explore how earlier computational tools attempted to tackle variant interpretation.

In Chapter 6, we’ll explore:

Why evolutionary conservation is predictive of functional impact
Traditional tools like SIFT and PolyPhen-2
Conservation scoring methods (GERP, phyloP, phastCons)
How these approaches prioritize variants for clinical investigation
The limitations that motivate machine learning approaches

Before moving on, make sure you:

Understand the differences between SNVs, indels, and SVs
Can explain why rare variants are enriched for functional impacts
Know the major types of noncoding regulatory elements
Appreciate the scale challenge (millions of variants per genome)
Have explored the gnomAD or ENCODE websites

Ready? Let’s see how conservation analysis and traditional tools approach variant interpretation!