RNA structure probing · method explainer
The method turns a chemical reaction into a quantitative structure signal. DMS marks adenines and cytosines that are accessible inside the cell; a read-through reverse transcriptase writes those marks as mismatches; and mismatch rates become a per-base view of RNA structure.
1 The question
An RNA molecule is a string over four letters — A, C, G, U — but it does not stay a string. It folds back on itself, pairing complementary letters (A·U, G·C) into double-stranded stems, leaving the unpaired letters out in loops. That folded shape is what decides what the RNA does: whether a ribosome can start translating it, whether it gets shipped to one end of a cell, whether a single point mutation flips it into a different conformation.
So the central measurement is deceptively simple to state: for every position in the molecule, is that base paired (buried) or unpaired (exposed)? Answer that across a whole transcript and you can reconstruct the structure. The catch is doing it in vivo — on real RNA, at native concentrations, inside cells — at single-nucleotide resolution.
Think of it as recovering a per-token label — exposed vs buried — for a sequence, where the label is never observed directly. You only get an indirect, noisy readout, and your job is to estimate the per-position probability of "exposed" from many independent measurements.
This is the same goal as classic footprinting or SHAPE, but the contribution here is the readout chemistry, not the probe. Hold onto the distinction between how a base gets marked (DMS) and how that mark is detected (reverse transcription). The whole method lives in the second step.
2 The probe
Dimethyl sulfate (DMS) is a small reagent that diffuses into cells and methylates adenine and cytosine — but only on the chemical face those bases use to pair up (the Watson–Crick edge). If an A or C is locked in a stem, that face is occupied and DMS can't touch it. If it's sitting in a loop, the face is free and DMS tags it. G and U are effectively invisible to this readout, so the probe reports on A's and C's.
That single fact is the whole physical basis of the method: a DMS mark means "this A or C was unpaired at the moment of treatment." Release the reagent below and watch where it sticks.
DMS reactivity on a hairpin
stem bases are paired · loop bases are exposed · only unpaired A & C get marked
What to notice: reagent that reaches a stem base or a G/U bounces off. It only commits to unpaired A and C. The pattern of marks is a fingerprint of the loop.
3 The readout
A sequencer reads DNA letters; it cannot see a methyl group on an RNA base. To convert the RNA (with its marks) into readable DNA, you use reverse transcriptase (RT), an enzyme that crawls along the RNA and writes a complementary DNA copy. The question is what RT does when it hits a marked base — and this is exactly where DMS-MaPseq parts ways with everything before it.
Two encoders of the same event. The old one is a 1-bit, lossy channel: it can emit the location of one mark per molecule and then the read ends. The new one is a high-capacity channel: every mark on the molecule is written inline as a symbol substitution, so one read carries the full joint pattern of marks.
The legacy approach (truncation): RT falls off the RNA when it reaches a mark. The end of the resulting DNA fragment therefore pinpoints where a mark was. But once RT falls off, the rest of the molecule is gone — so you learn about only the single most-downstream mark, one bit per molecule.
DMS-MaPseq (mutational profiling): using a read-through enzyme, RT does not stop. At a marked base it incorporates the wrong letter — a mismatch — and keeps going. Now a single DNA copy records every mark on that molecule, as a pattern of mismatches. Run both side by side:
One molecule, three marks — two ways to read it
RT enters at the 3′ end and copies toward 5′ · watch what survives
The whole paper in one frame: truncation throws away the molecule after the first mark it meets. Mutational profiling keeps the molecule intact and turns each mark into a typo you can later count. More marks per molecule is what unlocks everything downstream.
4 The calculation
One read is noisy: a mark is probabilistic, and the enzyme isn't perfect. The signal lives in the population. Align many reads of the same region into a pileup and, for each column, compute the fraction of reads that carry a mismatch there. That single number — mismatches ÷ total reads — is the ratiometric DMS signal: high where bases were open, near zero where they were paired.
The denominator is the same molecules' total coverage, so the rate is internally normalized — no paired untreated control is required for the structure calculation. The background is low and stochastic, not a reproducible per-base offset you'd subtract.
Each position's signal is just the maximum-likelihood estimate of a Bernoulli rate. Its variance scales like 1/coverage, which is why depth matters: the authors find reproducibility climbs sharply past ~20× per-base coverage. Drag the slider and resample to feel the variance shrink.
Pileup → per-position reactivity
each row is one read · pink cells are mismatches · bars below are mismatch ÷ total
What to notice: at low coverage the bars wander around the dashed "truth" each time you resample. Push coverage up and they snap to it. The peaks are the unpaired A/C positions — the loop.
5 The enzyme
Mutational profiling only works if the enzyme reliably writes a mismatch at a mark — not a deletion or insertion. The authors compare two read-through enzymes. The standard option, Superscript II with manganese (SSII–Mn²⁺), turns nearly a third of its marks into insertions and deletions. The TGIRT enzyme keeps almost everything as clean substitutions.
That matters because of where the signal sits. A substitution names one exact position. An indel inside a run of identical bases is ambiguous — you can't say which base it came from — so it smears the single-nucleotide resolution the whole method depends on.
Mutation type by enzyme & why indels hurt
left: composition of recorded mutations · right: a deletion in a homopolymer is positionally ambiguous
What to notice: TGIRT records ~94% of marks as mismatches vs ~71% for SSII–Mn²⁺. On the right, one missing base in A A A A A could have come from any of the five — a mismatch A→G could not. TGIRT also detects the harshest endogenous marks (methyl-A) at far higher rates, which is why the authors use it for everything else.
6 The payoff
Here is what truncation could never do. If a region folds into two different shapes in the cell — two subpopulations — a population-average readout returns their blend, which often looks like a third, fake structure that's no structure at all. Because DMS-MaPseq records the full mark pattern on each molecule, those molecules can be told apart.
The paper's clean demonstration is a ribosnitch: a single-nucleotide difference between two alleles of human MRPS21 that refolds the local structure. Analyzed together the signal is mush; split by allele, two crisp, mutually exclusive structures appear. Toggle below.
Two structures hiding in one average
reactivity across one region · the same data, combined vs. resolved
What to notice: in the combined view the starred bases sit at an ambiguous half-height — neither clearly open nor paired. Resolving by allele recovers two sharp, opposite patterns. This is the foothold for single-molecule structure analysis in general.
7 Reaching any RNA
A transcriptome-wide library spreads reads across everything, so low-abundance RNAs never reach usable coverage. The authors estimate that even at a billion mapped reads, most human genes still fall short of the ~20× threshold. The fix is to stop sequencing everything and amplify just the target: a gene-specific RT primer, a gene-specific PCR, then fragmentation and sequencing. The chemistry is identical — only the enrichment changes.
Targeted DMS-MaPseq workflow
same in-cell marking · selective amplification brings any transcript into reach
Optional refinement: a random barcode (UMI) on the RT primer tags each original molecule, so PCR duplicates can be collapsed — useful for quantitative or low-input work.
8 What it made possible
The method is the result, but the paper grounds it in applications that each exploit a different new capability — animal tissue, function, isoforms, and heterogeneity.
Animal tissue
oskar in fly ovaries
First RNA-structure probing inside an animal tissue. Targeted DMS-MaPseq on dissected Drosophila ovaries recovers the known oskar localization structure.
Function
FXR2 starts at a GUG
An extremely GC-rich human 5′ region folds into stems flanking a non-canonical GUG start. Disrupting the structure drops protein; compensatory mutations that restore the fold restore output — so the structure itself is the regulator.
Isoforms
pre- vs. mature mRNA
Intron- and exon-specific primers probe each splice isoform separately. The shared exon folds nearly identically before and after splicing — local structure refolds fast.
Heterogeneity
ribosnitch alleles
The MRPS21 A/C alleles adopt two distinct local structures, invisible in the average and separable here — the case from section 6.
Quality
low, stochastic noise
Negative-control paired bases pile up at near-zero signal, and replicates agree well — the clean signal-to-noise that lets a single ratio stand in for structure.
Where it leads
co-occurrence of marks
Because each read carries multiple marks, the joint statistics of which marks appear together become learnable — the raw material for clustering structural states and for models that predict pairing from sequence.
9 Honest limits
DMS reports on A and C only, so G/U pairing is inferred indirectly. The reactivity signal is a per-base accessibility measurement, not a structure — you still need a folding algorithm, with the data as constraints, to propose base pairs, and that step carries its own assumptions (the FXR2 model shifts with the reactivity threshold chosen). And while the data can in principle separate single-molecule states, the paper demonstrates that mainly through known sequence differences (alleles); fully unsupervised clustering of structural subpopulations is framed as the road ahead, not a solved result.