FAIRC
Thread // ARCHITECTURE

Token Deep Memory.

A dynamic memory architecture that mixes full-fidelity tokens with compressed gist embeddings to achieve massive context at sub-quadratic cost.

2 POSTS

The Quadratic Wall

Standard self-attention scales quadratically in context length because every token attends to every other token. At 128k tokens, the attention matrix holds 16 billion entries. At 1M tokens — a plausible requirement for long-horizon agents — it holds 1 trillion. The memory and compute cost is prohibitive even on modern hardware.

Linear attention and sparse attention are partial solutions. Linear attention approximates the softmax kernel, trading exact computation for efficiency but degrading on tasks that require precise token retrieval. Sparse attention restricts the attention pattern to local windows or learned patterns, but must decide ahead of time which tokens are worth attending to — a decision that can only be made correctly with the full context in view.

Memory as the Real Problem

The core issue is not attention computation — it is the assumption that all past tokens deserve equal representation fidelity. Human working memory does not store every sensory event at full resolution; it stores compressed summaries and retains high-fidelity traces only for events that were subsequently reinforced or retrieved.

A model that mimics this would maintain most historical context in a compressed state, with the option to reconstruct high-fidelity fragments on demand. The quadratic blowup disappears because the dense attention matrix is replaced by a much smaller matrix over compressed summaries.

Gist Embeddings and Retrieval

In our architecture, every k consecutive historical tokens are compressed into a single gist vector by a learned pooling operation. The gist preserves mean semantic content while discarding positional and high-frequency token-level variation. The compression schedule is adaptive: regions of high gradient signal during training are compressed more slowly; redundant stretches collapse aggressively.

Formally, for a block of token hidden states h₁, …, hₖ, the gist is computed as:

= Pool(h₁, …, hₖ) = W · mean(h₁, …, hₖ) + b

where W and b are learned. The pooling can be replaced with a small attention mechanism over the block for higher fidelity at modest cost.

Retrieval on Demand

Attention over gists runs in O(n/k) time — linear in the number of compressed blocks rather than the number of raw tokens. When the model's attention pattern signals that a compressed region is load-bearing for the current prediction, a retrieval head reconstructs an approximation of the original token sequence from the gist embedding.

This two-tier system — compressed storage, on-demand decompression — mirrors episodic memory architectures observed in biological neural systems, where the hippocampus stores compressed event representations and reconstructs detail when a retrieval cue matches the stored pattern.

Open Questions

The current design treats compression granularity k as a fixed hyperparameter. An obvious extension is a learned, dynamic k that varies based on local information density. We are also investigating whether gist vectors can serve as addressable keys in an external memory bank, enabling context windows that extend across multiple inference sessions.