A plain vector store answers one question: which stored text is most semantically similar to my query? That is enough for a chatbot recalling a fact, but it is not how a believable character remembers. A character should let vivid-but-old details fade, should surface the things that mattered even when they are not on-topic, and should keep a just-mentioned event warm. Park et al. ("Generative Agents: Interactive Simulacra of Human Behavior", UIST 2023, arXiv:2304.03442) built a memory architecture that does exactly this — a flat memory stream scored on three signals at retrieval time, plus a reflection step that periodically synthesizes higher-level insights and writes them back as new memories. This lesson is the runnable companion to agent memory, and it lives in this repo's agents-lab/ package as agents_lab/memory_stream.py.
Mental Model
Retrieval is more than similarity: the question is not "what is most relevant?" but "what is worth surfacing right now?" — and that depends on recency and importance too. A plain vector store ranks by relevance alone. The memory stream keeps a flat, ever-growing list of observations and, at retrieval time, scores each one on three orthogonal signals, normalizes them onto a common scale, and returns a weighted sum. Two structural details give the architecture its character. First, retrieval refreshes recency — reading a memory marks it as just-accessed, so frequently-recalled memories stay warm and rarely-touched ones decay. Second, the stream is not static: when enough important things have happened, the agent reflects, distilling recent memories into higher-level statements that are themselves stored back as memories. Those reflections then compete for retrieval like everything else, so the agent can recall conclusions, not just raw events.
This is the difference between a memory architecture and an embedding index. The Reflexion lab writes verbal lessons after failures; the memory stream generalizes that idea — reflection here is unsupervised, fired by an importance budget rather than an evaluator verdict.
The Three Retrieval Signals
Every memory is a record carrying its text, an importance score, a created_at and a last_accessed timestamp, and a precomputed embedding vec. At retrieval, the stream computes three arrays over all records and combines them.
Recency models forgetting. It decays exponentially in the time since the memory was last accessed — not since it was created. A memory you keep recalling never ages; one you ignore fades.
recency = DECAY_PER_HOUR ** ((now - r.last_accessed) / 3600.0) # DECAY_PER_HOUR = 0.99
With a per-hour decay of 0.99, a memory untouched for a day sits at 0.99 ** 24 ≈ 0.79; after a week, ≈ 0.19. The decay base is the knob that sets how quickly the agent's attention moves on.
Importance is the memory's intrinsic poignancy, scored once at creation by the LLM on a 1-10 scale ("brushing teeth" is a 1; "got into an argument with a partner" is an 8). It is stored on the record and normalized to [0, 1] (importance / 10) at retrieval. This is what lets a meaningful event surface even when it is not the most relevant thing to the current query.
Relevance is the familiar signal: cosine similarity between the query embedding and the memory embedding. Because both vectors are L2-normalized when stored (the stream reuses the Embedder protocol and _normalize from agents_lab/memory.py), the dot product is the cosine.
importance = r.importance / 10.0
relevance = r.vec @ q # cosine, both vectors L2-normalized
Combining the Signals: Min-Max, Then Weighted Sum
The three signals live on incomparable scales — recency is a decay factor, importance a normalized rating, relevance a cosine that for a real embedding model clusters in a narrow band well above zero. Summing them raw would let whichever signal happens to have the widest spread dominate. Park et al.'s fix is to min-max normalize each signal across the candidate set so each lands in [0, 1], then take a weighted sum.
def _minmax(x):
lo, hi = float(x.min()), float(x.max())
return (x - lo) / (hi - lo) if hi > lo else np.zeros_like(x)
score = (
w_recency * _minmax(recency)
+ w_importance * _minmax(importance)
+ w_relevance * _minmax(relevance)
)
order = np.argsort(score)[::-1][:k]
The default weights are all 1.0 — equal say to each signal. The paper found this simple equal weighting effective; tuning the weights tilts the agent's personality (crank w_recency and it lives in the moment; crank w_importance and it dwells on big events). The crucial subtlety is that min-max is computed relative to the current candidate pool: a memory's score is contextual, not absolute. The most relevant memory in this retrieval gets relevance 1.0 even if its raw cosine was middling.
Retrieval Refreshes Recency
After picking the top-k, the stream stamps each returned record's last_accessed = now. This single line is what makes recency a feedback loop rather than a clock:
for h in hits:
h.record.last_accessed = now # recalling a memory keeps it warm
A memory that keeps getting retrieved keeps resetting its decay to 1.0; a memory that is never relevant slides toward zero recency and effectively drops out of competition. The agent's working set of "live" memories emerges from use, with no eviction policy needed.
Reflection: Synthesizing Higher-Level Memories
Raw observations are episodic and literal. To act coherently over long horizons, an agent needs conclusions — "Klaus is passionate about his research", not the twenty separate observations that imply it. Reflection produces these.
Reflection is not on a timer; it fires on an importance budget. The stream accumulates the importance of every memory it stores, and when that running sum crosses a threshold, it reflects and resets the counter. Mundane days (low-importance observations) rarely trigger it; eventful ones do.
REFLECT_THRESHOLD = 30.0 # sum of importance since last reflection
def maybe_reflect(self, *, now=None, max_insights=3):
if self._importance_since_reflect < REFLECT_THRESHOLD:
return []
recent = "\n".join(f"- {r.text}" for r in self._records[-20:])
msg = model.invoke([
SystemMessage(f"From these memories, infer up to {max_insights} high-level insights. "
"One insight per line, no numbering."),
HumanMessage(recent),
])
insights = [ln.strip(" -*\t") for ln in str(msg.content).splitlines() if ln.strip()][:max_insights]
for ins in insights:
self.add(ins, now=now, kind="reflection") # stored back as memories
self._importance_since_reflect = 0.0
return insights
Each insight is added through the normal add path, so it gets its own importance score, timestamps, and embedding, and is tagged kind="reflection". The consequence is recursive: because reflections live in the same stream, a later reflection can draw on earlier reflections, building a shallow hierarchy of abstraction. The paper's full version first asks the LLM for salient questions, retrieves per-question, then synthesizes; the lab keeps the structure but flattens the question-generation step for clarity.
Contrast: Why Not Just a Vector Store?
The VectorMemory in agents_lab/memory.py ranks purely by relevance — embed the query, return the nearest neighbors. That is the right tool for retrieval-augmented generation over a knowledge base, where every document is timeless and equally important. It is the wrong tool for a character, because it has no notion of time and no notion of significance:
- No recency. A vivid memory from a month ago ties with one from a minute ago if they are equally relevant. The agent cannot "move on."
- No importance. A trivial observation that happens to be lexically close to the query outranks a life-changing event that is phrased differently.
- No synthesis. It can only ever return raw stored text; it never forms generalizations, so the agent re-derives the same conclusions every time.
The memory stream is a strict superset: set w_recency = w_importance = 0 and skip reflection and you are back to a plain vector store. The extra two signals plus reflection are what turn an index into a memory architecture. See agent memory for where this sits in the broader loop and how embeddings power the relevance leg.
Run it
The stream builds on agents_lab/memory.py: embeddings come from FastEmbed (bge-small, in-process ONNX), so retrieval itself adds no paid-API cost. Both the importance_fn and the reflection llm are injectable, so you can exercise the full mechanism offline. DeepSeek is the only paid API in the lab, and it is touched only if you let importance scoring or reflection fall back to the default LLM.
from agents_lab.memory_stream import MemoryStream
# importance_fn injected -> no LLM call for scoring; embeddings are local.
ms = MemoryStream(importance_fn=lambda t: 9.0 if "important" in t else 2.0)
t0 = 0.0
ms.add("important cat meeting notes", now=t0) # importance 9
ms.add("the office plant needs watering", now=t0) # importance 2
# Retrieve a day later: relevance favors "cat", importance favors the meeting,
# recency has decayed equally for both. Retrieval refreshes their last_accessed.
t1 = t0 + 24 * 3600
hits = ms.retrieve("cat", k=3, now=t1)
for h in hits:
print(f"{h.score:.3f} {h.record.text}")
# Reflection only fires once accumulated importance crosses REFLECT_THRESHOLD (30).
# With an injected llm it runs fully offline; otherwise it falls back to DeepSeek.
insights = ms.maybe_reflect(now=t1)
print("insights:", insights)
To watch reflection actually trigger, add enough high-importance memories to push _importance_since_reflect past 30 (four important items at 9 each), then call maybe_reflect. With a stubbed llm you can assert on the synthesized text; with the default, you will see DeepSeek-generated insights stored back as kind="reflection" records that then compete in the next retrieve. This is the runnable companion to agent memory — start there for the conceptual map, then come here to feel the three signals trade off in code.