Rouskin Lab
Albatross · dependency mapping
I'm coming from
RNA structure from sequence alone / an explainer

How an RNA language model exposes structure from sequence.

Albatross starts from a restrained question: if a model trained only to recover missing nucleotides is perturbed at one position, which other positions change their predictions? In structured IRES RNAs, those dependencies often trace the molecule's hidden folded structure. Here is how the signal appears — and where it quietly breaks.

scroll to begin
01  /  the thing we cannot see

A code we can read but not fold

Picornaviruses — polio, the common cold, hepatitis A — can't start translation the normal way. Instead, a long folded stretch of RNA at the front of their genome, an IRES, grabs the ribosome by its shape. Shape is everything; the same letters folded differently behave differently.

The authors measured how strongly 96 of these IRESes drive translation, across six human and animal cell types. The newest class (Type V) drives roughly twice the output of EMCV, the element everyone uses in gene therapy, and most IRESes are tissue-specific. Rich behavior — which demands a structural explanation. The problem: we have experimentally solved structures for only a handful.

02  /  the incumbent and its blind spot

Covariation needs a crowd

The gold standard for finding pairs without a microscope is covariation: line up many related sequences and look for pairs of positions that mutate together to stay complementary. It works beautifully — when you have a good alignment of many close relatives.

Picornaviral IRESes are notoriously divergent. For the strangest ones you might find only a handful of relatives, and no clean alignment at all. Covariation simply has nothing to work with. That is the gap.

What if you could read the pairs out of a single sequence — no alignment, no relatives?
03  /  the core gesture

What a dependency map is

Take a model trained on one job only: given an RNA sequence with a position hidden, guess the missing letter. To do that well across millions of sequences, it has to internalize how positions constrain each other. Base-paired positions constrain each other the most.

So we run a tiny experiment, position by position. Mutate one nucleotide. Re-run the model. Ask: whose prediction changed? The position that reacts most is, overwhelmingly, the base-pairing partner. Try it — click a nucleotide below.

Perturb & watch
interactive · schematic hairpin
A 14-nt hairpin: a 5-pair stem closed by a small loop. Click a position to mutate it; the bars show how much the model's confidence shifts at every other position.
the molecule — click a base to perturb it
|change in model's prediction| at each position
Pick any nucleotide above. Watch which other position reacts.
Schematic, built to mirror the published behavior. In the paper the same procedure runs on real IRESes with the 650M-parameter model: for a length-N sequence it performs 3N+1 forward passes (the original plus every single-point substitution) and records the largest shift in log-odds.

Do this for every position and you get an N×N grid — the dependency map. Cell (i, j) is bright when mutating i swings the prediction at j. The diagonal is a position reacting to itself, so it's masked out. Everything interesting lives off the diagonal.

04  /  the duality

A stem is an antidiagonal

Here's the visual key to everything that follows. In a stem, the first base pairs with the last, the second with the second-to-last, and so on. As one index climbs, its partner's index descends. Plotted on the map, that traces a short line running across the diagonal — an antidiagonal. Each antidiagonal stripe is one stem. Watch a stripe fold into a stem:

Antidiagonal → arc
animation
dependency map — one boxed stripe
the same stem, drawn as a molecule
The bright antidiagonal cells (top-left ↔ bottom-right of the boxed region) are exactly the pairs that close into the stem on the right. Parallel-to-diagonal would mean adjacent positions — which cannot pair. Perpendicular is the signal.
05  /  nobody supplied the answer

Structure it was never given

This is the part worth dwelling on. The base model, RiNALMo, already knew general RNA — but run a dependency map on an IRES and you get noise. Fine-tune the very same architecture on ~50,000 IRES sequences (still just letters, still just masked-token guessing) and the antidiagonals snap into focus. The model that emerges is Albatross.

Nothing about pairing, geometry, or thermodynamics was ever added. Only more of the right sequences. Drag the slider to fine-tune.

From noise to stems
interactive · EMCV-style map
stage
base model
micro F1 vs solved structures
0.05
Mirrors Figure 3B–C and the training curves in 3G. Signal appears fast, then a subtlety the slider hints at: training too long, or on too many near-duplicate sequences, makes it worse. A smaller, diversity-balanced 50k set beat a 500k set — more data is not the lever; coverage is.
06  /  from a heatmap to a structure

Commit, but only when sure

Turning the map into an actual list of pairs uses a filter with no biology in it at all — only geometry: ignore weak signal below a threshold, drop the un-pairable band next to the diagonal, keep only stripes long enough to be a real stem, then match each position to at most one partner. No base-pairing rules are enforced, which lets it propose non-canonical pairs.

The threshold is the dial that matters. Raise it and the model only commits to pairs it's sure of: precision soars, recall falls. That trade-off is the method's signature — and a thing to keep honest about. Move the slider:

Precision ↔ recall
interactive
predicted pairs (solid) vs real pairs the model passed on (dashed)
precision
0.72 "when it says a pair, it's right"
recall
0.41 "of all real pairs, fraction found"
predicted & correct predicted, wrong real pair, missed
Operating point from the paper: τ = 0.11 gives median precision ~0.78 with recall ~0.41. High precision means the predictions you get are trustworthy scaffolds; low recall means it is deliberately not guessing at everything.
07  /  head to head

Where covariation falls off a cliff

Now the comparison that matters. Bin every IRES by how many similar relatives exist to align. When relatives are plentiful, covariation (here, CaCoFold) keeps up. When they vanish — the regime that actually stumps biologists — covariation collapses and Albatross holds. Pick a bin:

Precision by number of close relatives
interactive · per-structure points
Albatross
CaCoFold (covariation)
In the 1–9 bin the gap is largest. The most extreme case — a Gallivirus IRES with just four related sequences — defeats covariation entirely, yet Albatross predicts its structure correctly. How?
08  /  the mechanism, not magic

It isn't covariation — it's memory of motifs

With four relatives there's no covariation to exploit, so Albatross must be doing something else. It has learned that certain little sequence words imply certain folds. A classic one is the GNRA tetraloop — see that word, expect the stem that follows it.

To prove it, break the word. Mutate the motif and the prediction degrades; delete it and the stem vanishes entirely. Try each:

Break the motif, lose the stem
interactive · Gallivirus GNRA
dependency map of the motif region
signal-to-noise of the stem
Even a conservative two-letter change injects visible noise; scrambling to CCCC floods it; deleting the motif removes the stem outright. The map "knows" the stem because it recognizes the word — a capability covariation has no access to.
09  /  the trajectory

Bigger model, broader rules

One more result points the way forward. Fine-tune three sizes — 33M, 150M, 650M parameters — on Type I IRESes only, then test on every type. Larger models find more structure, as expected. The striking part: they also get better on types they never trained on. The rules it learns generalize. Switch sizes:

Predicted base pairs by IRES type
interactive · trained on Type I only
bars: Type I (trained) · Types II, IV, V (held out)
Mirrors Figure 5I. Held-out types are structurally divergent from Type I, so the gains can't be covariation leaking through — the model is learning transferable structural logic that scales with capacity.
10  /  staying honest

What it still can't do

The map reads dependencies out. The obvious sequel is to write sequences that produce a chosen dependency pattern — inverse folding, supervised by the map itself.