Vitoria Lima

notes 03/30: from paper to products

somewhere in PRsomewhere in PRI wrote this during my coding vacation. I really like to surf first thing in the morning and then go all in in my many rabbit holes. If you like longboard surfing pls reach out with any reccs!, March 30 2026

I think that cool papers like these, S4arxiv.org previewarxiv.org/abs/2111.00396 (see explanation below) and RLMsarxiv.org previewarxiv.org/abs/2512.24601v1 (see my prev blogpostnotes-rlms), are what create awesome products down the line. That is the question I always end up at when I read a paper that excites me. Not "is this theoretically interesting," but what kind of products can come out of this?


What are state space models?

State space modelsstate space model (SSM)A mathematical framework from control theory that models a system as a hidden state evolving over time. The state captures everything about the system; inputs push it, outputs observe it. Originally used for rockets and filters, now adapted for sequence modeling in deep learning. come from control theorycontrol theoryA branch of engineering and applied mathematics that deals with controlling the behavior of dynamical systems. It is used to design systems that achieve desired outputs by manipulating inputs, using feedback loops and mathematical models., the branch of engineering that figures out how to steer rockets, stabilize airplanes, and keep a thermostat from overshooting. The core idea is beautifully simple: you have a hidden state that evolves over time according to some rules, and you can only partially observe it through measurements.

state space equations
x′(t) = Ax(t) + Bu(t)
y(t) = Cx(t) + Du(t)

The matrix A governs how the state evolves on its own (the system's internal dynamics). B controls how inputs push the state around. C maps the hidden state to what you actually observe, and D is a direct input-to-output shortcut (often zero).

S1 - state evolution under A

The animation above shows a 2D state vector evolving over time under a rotation matrix A. Each input nudges the state, and the trail shows the system's memory of its trajectory. This is a toy example (real SSMs operate in much higher dimensions), but the principle is the same: state in, state out, remember what matters.


What is S4, and why it matters

The S4 paperarxiv.org previewarxiv.org/abs/2111.00396 (Structured State Spaces for Sequence Modeling) took the classic state space equations above and asked: what if we made this a deep learning layer?

There are three key ideas that make S4 work. Let me walk through them.

1. HiPPO: remember everything

The first insight is in the A matrix. The authors used a specific initialization called HiPPOHiPPOHigh-order Polynomial Projection Operator. A principled initialization for the A matrix that enables the state to maintain a compressed memory of the entire input history by projecting it onto a basis of orthogonal polynomials.. The problem it solves: how do you maintain a fixed-size memory of a growing history without forgetting?

The answer: project the input history onto a basis of Legendre polynomialsLegendre polynomialsA family of orthogonal polynomials that form a complete basis. Any function can be approximated by a weighted sum of Legendre polynomials. The more terms you keep, the better the approximation.. Think of it like this: instead of storing every raw input, you store N coefficients that together can reconstruct the entire history. As new inputs arrive, you update these coefficients with a matrix multiply. The HiPPO matrix A is not learned. It is derived mathematically so that the coefficients always optimally approximate the full history.

The result: unlike RNNs where gradients decay exponentially (you forget old inputs), HiPPO gives polynomial decay. Information from 1 million timesteps ago is still recoverable. Not a sliding window, not a fixed buffer.

A mathematical projection of the entire history.

2. Discretization: from continuous to computable

The state space equations above are continuous (they use derivatives). Computers work in discrete steps. So S4 discretizesdiscretizationConverting a continuous-time system (differential equations) into a discrete-time system (difference equations) that can be computed step by step. S4 uses the bilinear method, which preserves stability and frequency response. the system using a step size Δ:

discretized SSM
xk = A xk-1 + B uk
yk = C xk
where A and B are derived from A, B, and step size Δ

Now it reads like a simple loop: take the previous state, multiply by A, add the new input scaled by B, read the output through C. This is the recurrence mode, and it processes one token at a time in O(1).

3. The convolution trick: why it trains fast

Here is where it gets clever. If you unroll that recurrence for a full sequence, something beautiful happens:

unrolling the recurrence
y0 = CB u0
y1 = CAB u0 + CB u1
y2 = CA2B u0 + CAB u1 + CB u2
each output is a weighted sum of all previous inputs

See the pattern? Each output yk is a weighted sum of all inputs up to that point. The weights are CAnB for different values of n. This is exactly the definition of a convolutionconvolutionAn operation that computes a weighted sum over a sequence using a fixed set of weights (called a kernel). In signal processing, it blends nearby values together. In deep learning, it is the core operation behind CNNs. The key property: it can be computed efficiently using the Fast Fourier Transform (FFT)..

The kernel (the set of weights) is:

the convolution kernel
K = [CB,   CAB,   CA2B,   ...   CAN-1B]
precompute this once, then apply via FFT to the entire sequence

And here is the punchline: convolutions can be computed with the Fast Fourier Transform (FFT) in O(N log N), instead of the O(N2) that naive convolution or Transformer attention would require.

So S4 has two modes. The bars below show relative compute cost for processing a sequence of N tokens (hover over each for details):

the dual nature of S4
Transformer
O(N2)
Self-attention. Every token attends to every other token, producing an N×N attention matrix. For a 4K sequence: 16 million operations. For 128K: 16 billion. Scales quadratically.
S4 (train)
O(N log N)
Convolution mode. Precompute the kernel K once, then convolve with the input via FFT. For 128K tokens: ~2.2 million operations instead of 16 billion. Processes the entire sequence in parallel.
S4 (inference)
O(1) per token
Recurrence mode. xk = Axk-1 + Buk. One matrix-vector multiply per token, regardless of sequence length. The state vector is fixed-size. No attention matrix, no growing memory.
train: precompute kernel K, convolve via FFT (parallel, fast)
inference: run the recurrence step by step (O(1) per token, streaming)

The elegance is that it is the same model. During training, you use convolution mode to process entire sequences in parallel. During inference, you switch to recurrence mode and process tokens one at a time. Same weights, two computational paths.

The result: a model that can do what a Transformer can do, but with dramatically fewer trainable parameters and much better scaling on long sequences. I went deep on itportfolio/ai-project-3 back in 2022 and saw an 86% reduction in trainable parameters on the tasks I tested.


S4 to Cartesia

I got so obsessed with the S4 paper when it came out back in February 2022. Brilliant concept borrowed from math: state spaces, but make it deep learning.

And what came out of it a few years latercartesia.ai/blog/seed? Cartesiacartesia.ai/sonic! Because since an S4 model can do what a Transformer can do, but with less trainable parameters, now all of a sudden you can process heavier data. Audio is heavier than language. So of course now this is perfect for a voice product API. The math enabled the market.


RLMs to ...

And now, what can RLMs unleash for products?

The obvious answer is anything that chokes on context today: legal research across million-page archives, codebase-wide reasoning, multi-document synthesis. But the less obvious and more exciting answer is that recursive self-delegation might be the primitive behind the first AI products that feel like they genuinely remember, not because they store every token, but because they learned to go back and look.

This is one of the many million dollar questions, and the kind of questions that are my roman empireroman empireGen Z slang. Something you think about constantly and obsessively. Originates from a 2023 TikTok trend where people asked their partners how often they think about the Roman Empire. The answer was: always. and every day curse.

Find out more about RLMs from my other notes herenotes-rlms.

If you are curious about this kind of conversations, where research papers become products, where the math unlocks the market, I would love to get you a coffee and a croissant.

vitoria@vitorialima.com
Did you know?
The Kalman filter, arguably the most famous state space model ever built, was used on the Apollo missionsnasa.gov previewnasa.gov/specials/apollo50th to navigate to the Moon. It estimated the spacecraft's position and velocity from noisy sensor readings, fusing radar, accelerometer, and star-tracker data into a single best guess of where the spacecraft actually was.
Rudolf Kalmanen.wikipedia.org/wiki/Rudolf_E._Kalman published his paper in 1960. By 1969, his algorithm was running on the Apollo Guidance Computer, a machine with 74KB of memory. State space models went from a math paper to a Moon landing in nine years.
Modern SSMs like S4 and Mamba are the deep learning descendants of that same idea: hidden state, evolve, observe, repeat.