RNA structure probing · method explainer

DMS-MaPseq reads in-cell RNA accessibility as sequencing mutations.

The method turns a chemical reaction into a quantitative structure signal. DMS marks adenines and cytosines that are accessible inside the cell; a read-through reverse transcriptase writes those marks as mismatches; and mismatch rates become a per-base view of RNA structure.

After Zubradt, Gupta, Persad, Lambowitz, Weissman & Rouskin, Nature Methods 2017 · built for readers fluent in either deep learning or RNA biology, not necessarily both.

1 The question

A sequence is not a shape

An RNA molecule is a string over four letters — A, C, G, U — but it does not stay a string. It folds back on itself, pairing complementary letters (A·U, G·C) into double-stranded stems, leaving the unpaired letters out in loops. That folded shape is what decides what the RNA does: whether a ribosome can start translating it, whether it gets shipped to one end of a cell, whether a single point mutation flips it into a different conformation.

So the central measurement is deceptively simple to state: for every position in the molecule, is that base paired (buried) or unpaired (exposed)? Answer that across a whole transcript and you can reconstruct the structure. The catch is doing it in vivo — on real RNA, at native concentrations, inside cells — at single-nucleotide resolution.

If you come from ML

Think of it as recovering a per-token label — exposed vs buried — for a sequence, where the label is never observed directly. You only get an indirect, noisy readout, and your job is to estimate the per-position probability of "exposed" from many independent measurements.

If you come from RNA

This is the same goal as classic footprinting or SHAPE, but the contribution here is the readout chemistry, not the probe. Hold onto the distinction between how a base gets marked (DMS) and how that mark is detected (reverse transcription). The whole method lives in the second step.

2 The probe

DMS stains only the open bases

Dimethyl sulfate (DMS) is a small reagent that diffuses into cells and methylates adenine and cytosine — but only on the chemical face those bases use to pair up (the Watson–Crick edge). If an A or C is locked in a stem, that face is occupied and DMS can't touch it. If it's sitting in a loop, the face is free and DMS tags it. G and U are effectively invisible to this readout, so the probe reports on A's and C's.

That single fact is the whole physical basis of the method: a DMS mark means "this A or C was unpaired at the moment of treatment." Release the reagent below and watch where it sticks.

DMS reactivity on a hairpin

stem bases are paired · loop bases are exposed · only unpaired A & C get marked

marked: 0 bases

paired (protected) unpaired A / C (reactive) unpaired G / U (not read) DMS mark

What to notice: reagent that reaches a stem base or a G/U bounces off. It only commits to unpaired A and C. The pattern of marks is a fingerprint of the loop.

3 The readout

The hard part: seeing a chemical mark in a sequencer

A sequencer reads DNA letters; it cannot see a methyl group on an RNA base. To convert the RNA (with its marks) into readable DNA, you use reverse transcriptase (RT), an enzyme that crawls along the RNA and writes a complementary DNA copy. The question is what RT does when it hits a marked base — and this is exactly where DMS-MaPseq parts ways with everything before it.

If you come from ML

Two encoders of the same event. The old one is a 1-bit, lossy channel: it can emit the location of one mark per molecule and then the read ends. The new one is a high-capacity channel: every mark on the molecule is written inline as a symbol substitution, so one read carries the full joint pattern of marks.

The legacy approach (truncation): RT falls off the RNA when it reaches a mark. The end of the resulting DNA fragment therefore pinpoints where a mark was. But once RT falls off, the rest of the molecule is gone — so you learn about only the single most-downstream mark, one bit per molecule.

DMS-MaPseq (mutational profiling): using a read-through enzyme, RT does not stop. At a marked base it incorporates the wrong letter — a mismatch — and keeps going. Now a single DNA copy records every mark on that molecule, as a pattern of mismatches. Run both side by side:

One molecule, three marks — two ways to read it

RT enters at the 3′ end and copies toward 5′ · watch what survives

marks recorded — truncation 0/3 · MaP 0/3

DMS mark on template cDNA written by RT mismatch (the recorded signal)

The whole paper in one frame: truncation throws away the molecule after the first mark it meets. Mutational profiling keeps the molecule intact and turns each mark into a typo you can later count. More marks per molecule is what unlocks everything downstream.

4 The calculation

Stack the reads, count the mutations

One read is noisy: a mark is probabilistic, and the enzyme isn't perfect. The signal lives in the population. Align many reads of the same region into a pileup and, for each column, compute the fraction of reads that carry a mismatch there. That single number — mismatches ÷ total reads — is the ratiometric DMS signal: high where bases were open, near zero where they were paired.

If you come from RNA

The denominator is the same molecules' total coverage, so the rate is internally normalized — no paired untreated control is required for the structure calculation. The background is low and stochastic, not a reproducible per-base offset you'd subtract.

If you come from ML

Each position's signal is just the maximum-likelihood estimate of a Bernoulli rate. Its variance scales like 1/coverage, which is why depth matters: the authors find reproducibility climbs sharply past ~20× per-base coverage. Drag the slider and resample to feel the variance shrink.

Pileup → per-position reactivity

each row is one read · pink cells are mismatches · bars below are mismatch ÷ total

coverage 8× low coverage ⇒ jumpy estimate · high ⇒ locks onto the truth

mismatch match measured reactivity true rate

What to notice: at low coverage the bars wander around the dashed "truth" each time you resample. Push coverage up and they snap to it. The peaks are the unpaired A/C positions — the loop.

5 The enzyme

Why a thermostable group II intron RT, and not the usual one

Mutational profiling only works if the enzyme reliably writes a mismatch at a mark — not a deletion or insertion. The authors compare two read-through enzymes. The standard option, Superscript II with manganese (SSII–Mn²⁺), turns nearly a third of its marks into insertions and deletions. The TGIRT enzyme keeps almost everything as clean substitutions.

That matters because of where the signal sits. A substitution names one exact position. An indel inside a run of identical bases is ambiguous — you can't say which base it came from — so it smears the single-nucleotide resolution the whole method depends on.

Mutation type by enzyme & why indels hurt

left: composition of recorded mutations · right: a deletion in a homopolymer is positionally ambiguous

mismatch (usable, single-nt) deletion insertion

What to notice: TGIRT records ~94% of marks as mismatches vs ~71% for SSII–Mn²⁺. On the right, one missing base in A A A A A could have come from any of the five — a mismatch A→G could not. TGIRT also detects the harshest endogenous marks (methyl-A) at far higher rates, which is why the authors use it for everything else.

6 The payoff

Many marks per molecule means you can separate mixtures

Here is what truncation could never do. If a region folds into two different shapes in the cell — two subpopulations — a population-average readout returns their blend, which often looks like a third, fake structure that's no structure at all. Because DMS-MaPseq records the full mark pattern on each molecule, those molecules can be told apart.

The paper's clean demonstration is a ribosnitch: a single-nucleotide difference between two alleles of human MRPS21 that refolds the local structure. Analyzed together the signal is mush; split by allele, two crisp, mutually exclusive structures appear. Toggle below.

Two structures hiding in one average

reactivity across one region · the same data, combined vs. resolved

averaging blurs both states into an uninformative middle

open in this state paired in this state ★ = base whose pairing flips between alleles

What to notice: in the combined view the starred bases sit at an ambiguous half-height — neither clearly open nor paired. Resolving by allele recovers two sharp, opposite patterns. This is the foothold for single-molecule structure analysis in general.

7 Reaching any RNA

Genome-wide is blind to rare transcripts — so target them

A transcriptome-wide library spreads reads across everything, so low-abundance RNAs never reach usable coverage. The authors estimate that even at a billion mapped reads, most human genes still fall short of the ~20× threshold. The fix is to stop sequencing everything and amplify just the target: a gene-specific RT primer, a gene-specific PCR, then fragmentation and sequencing. The chemistry is identical — only the enrichment changes.

Targeted DMS-MaPseq workflow

same in-cell marking · selective amplification brings any transcript into reach

Optional refinement: a random barcode (UMI) on the RT primer tags each original molecule, so PCR duplicates can be collapsed — useful for quantitative or low-input work.

8 What it made possible

Four things you couldn't do before

The method is the result, but the paper grounds it in applications that each exploit a different new capability — animal tissue, function, isoforms, and heterogeneity.

Animal tissue

oskar in fly ovaries

First RNA-structure probing inside an animal tissue. Targeted DMS-MaPseq on dissected Drosophila ovaries recovers the known oskar localization structure.

Function

FXR2 starts at a GUG

An extremely GC-rich human 5′ region folds into stems flanking a non-canonical GUG start. Disrupting the structure drops protein; compensatory mutations that restore the fold restore output — so the structure itself is the regulator.

Isoforms

pre- vs. mature mRNA

Intron- and exon-specific primers probe each splice isoform separately. The shared exon folds nearly identically before and after splicing — local structure refolds fast.

Heterogeneity

ribosnitch alleles

The MRPS21 A/C alleles adopt two distinct local structures, invisible in the average and separable here — the case from section 6.

Quality

low, stochastic noise

Negative-control paired bases pile up at near-zero signal, and replicates agree well — the clean signal-to-noise that lets a single ratio stand in for structure.

Where it leads

co-occurrence of marks

Because each read carries multiple marks, the joint statistics of which marks appear together become learnable — the raw material for clustering structural states and for models that predict pairing from sequence.

9 Honest limits

What it does not tell you

DMS reports on A and C only, so G/U pairing is inferred indirectly. The reactivity signal is a per-base accessibility measurement, not a structure — you still need a folding algorithm, with the data as constraints, to propose base pairs, and that step carries its own assumptions (the FXR2 model shifts with the reactivity threshold chosen). And while the data can in principle separate single-molecule states, the paper demonstrates that mainly through known sequence differences (alleles); fully unsupervised clustering of structural subpopulations is framed as the road ahead, not a solved result.

The method in nine claims

RNA shape, not sequence, sets function — so you need per-base paired/unpaired calls in vivo.
DMS methylates only unpaired A and C; a mark means "this base was open."
The mark must be read by an enzyme; the old way (truncation) records one mark per molecule and discards the rest.
Mutational profiling writes each mark as a mismatch and reads through — every mark survives.
Per position, mismatches ÷ total reads is the structure signal; no untreated control needed.
Coverage controls variance; reproducibility climbs past ~20×.
TGIRT keeps marks as clean substitutions (not indels), preserving single-nucleotide resolution.
Multiple marks per molecule let you separate mixtures — e.g. ribosnitch alleles.
Targeted RT–PCR brings rare transcripts, tissues, and isoforms into reach.