Beyond Chat History: A Multi-Layer Memory Stack for Dockerized AI Agents

Multi-Memory Harness Technique for AI LLM CLI (codex, opencode, pi, etc)

LLM context windows are temporary.
Agent memory needs layers.
Not all memory should be treated equally.
Curated facts and raw transcripts serve different purposes.
Local embeddings + pgvector let agents retrieve context without sending memory to external embedding APIs.
Metrics are necessary to prove memory is helping, not just accumulating noise.

PRIMARY CLAIM — Memory retrieval reduced repeated context reconstruction and likely reduced downstream LLM reasoning/search work.

Table of Contents

Two Kinds of Mem

2 kinds of ‘long-term memory’, 2 strengths, 2 uses/reasons

1 Postgres PGVECTOR

Yes — pgvector long-term memory is working. Verified just now:
codex-memory exists: /usr/local/bin/codex-memory
Project profile resolves for /ws
Schema is ready: codex memory schema is ready
Search returned stored memories successfully
Services are healthy:
- codex-memory-db up/healthy
- codex-embeddings up/healthy
Database rows have embeddings with vector_dims = 384 One note: my first low-level SQL check used the wrong column name (project); actual schema uses project_name / project_key. Retried with the correct
schema and confirmed stored embedded rows.

2 WIKI

Yes — memory-wiki is working for long-term memory. Verified just now:
memory-wiki CLI exists and is executable.
Long-running sidecar is up:
- codex-memory-wiki running memory-wiki worker
Dependencies are healthy:
- codex-memory-db healthy
- codex-embeddings healthy
memory-wiki status reports:
- memory_wiki_sources=116
- memory_wiki_chunks=2431
- data_home=/codex/memory-wiki-data
Search works and returned indexed transcript/wiki chunks with source JSONL paths and generated wiki Markdown paths. One note: running docker compose run –rm memory-wiki status caused Compose to recreate/restart the db/embeddings containers, but they came back
healthy immediately and search succeeded afterward.

How the memories are best used

› posit: pgv stores vectors and mwiki stores text

Basically yes, with one nuance:
pgvector / codex-memory
- Stores curated long-term facts/preferences/decisions.
- Stores the text content plus a vector embedding in Postgres.
- The vector is what enables semantic search.
- Think: “approved facts and preferences layer.”
memory-wiki
- Stores/indexes broader transcript/wiki history.
- Keeps human-readable text/wiki pages in local durable storage.
- Also stores chunk metadata and embeddings in pgvector-backed Postgres for search.
- Think: “durable dev/research logbook from sessions.”
So the more precise model is:

Both systems store text and vectors.
codex-memory is the curated fact layer.
memory-wiki is the larger transcript/wiki history layer.
pgvector is the vector-search storage engine underneath both.

Layer	Purpose	Analogy
pgv / codex-memory	Approved facts, preferences, decisions	“Working institutional memory”
mwiki	Durable transcripts and wiki history	“Long-term research archive”
embeddings service	Local semantic indexing	“Memory encoder”
pgvector/Postgres	Vector + metadata store	“Searchable memory database”
Codex agents	Consumers/producers	“Distributed workers with shared recall”

## Plan: Memory Effectiveness + Retrieval Metrics

### 1. Define what “effective memory hit” means

Track each memory retrieval as one of:

Miss: no relevant memory found.
Weak hit: memory returned but not used.
Useful hit: memory affected the answer/plan.
Critical hit: memory prevented rework, corrected assumptions, restored prior context, or avoided asking the user.
Stale/bad hit: memory was retrieved but wrong/outdated/noisy. For blog purposes, the key stat is not just “retrieved memories,” but:

“How often did memory reduce repeated explanation, investigation, or external/LLM context reconstruction?”

### 2. Track separate read/write counters

Track independently:

#### pgv / codex-memory

Search count
Search query text/hash
Result count
Top distance/similarity
Used/not used
Memory IDs used
Adds/writes
Updates/deletes if implemented
Scope: global/project
Kind: fact/preference/decision/gotcha/etc. mwiki
Search count
Search query text/hash
Result count
Top distance/similarity
Source transcript/session IDs retrieved
Wiki pages opened/read
Chunks retrieved
Used/not used
New sources ingested
New chunks written
Generated wiki pages written

### 3. Track “LLM hit reduction” proxies

Direct measurement of “LLM hits avoided” is hard, but good proxies are possible:

#### A. User-repeat avoidance

Count when memory avoids asking:

“Remind me what we decided?”
“Where is that file?”
“What was the architecture?”
“What did we try last time?” Metric:

questions_avoided_by_memory

#### B. Re-investigation avoidance

Count when memory avoids filesystem/web/docker investigation.

Metric examples:

shell_commands_avoided_estimate
web_searches_avoided_estimate
files_reopened_avoided_estimate
prior_context_reconstructed_from_memory C. Prompt/context compression Estimate how much text was retrieved versus how much would have been needed in prompt. Metrics:
memory_tokens_retrieved
estimated_prompt_tokens_saved
transcript_tokens_not_reloaded D. Task acceleration Track time/turn reduction where possible. Metrics:
turns_saved_estimate
minutes_saved_estimate
context_recovery_time_ms

### 4. Add lightweight telemetry events

Create an append-only local JSONL log, probably ignored by git:

data/memory-metrics/events.jsonl

Example event types:

{“type”:”pgv_search”,”ts”:”…”,”query”:”…”,”result_count”:3,”top_distance”:0.10}
{“type”:”pgv_used”,”ts”:”…”,”memory_ids”:[1,6],”usefulness”:”critical”,”tokens_saved_estimate”:1200}
{“type”:”mwiki_search”,”ts”:”…”,”query”:”…”,”result_count”:3,”top_distance”:0.22}
{“type”:”mwiki_used”,”ts”:”…”,”source_ids”:[3437],”usefulness”:”useful”,”turns_saved_estimate”:2}
{“type”:”pgv_write”,”ts”:”…”,”kind”:”preference”,”scope”:”project”}
{“type”:”mwiki_ingest”,”ts”:”…”,”sources_added”:4,”chunks_added”:83}

### 5. Build a simple reporting command

Add a script/CLI such as:

./memory-metrics.sh summary
./memory-metrics.sh last-7-days
./memory-metrics.sh blog-report

Report:

Total pgv searches
Total mwiki searches
Total writes
Useful hit rate
Critical hit rate
Stale/bad hit rate
Estimated prompt tokens saved
Estimated turns saved
Top projects using memory
Top recurring memory categories
Ratio of curated memory hits vs transcript/wiki hits

### 7. Measure retrieval quality over time

Add manual or semi-automatic ratings:

0 = miss
1 = weak/noisy
2 = useful
3 = critical
-1 = stale/bad

Then report:

memory_hit_quality_avg
useful_or_better_rate
stale_hit_rate

This would make the system feel empirically grounded rather than hand-wavy.

Beyond Chat History: A Multi-Layer Memory Stack for Dockerized AI Agents

Two Kinds of Mem

1 Postgres PGVECTOR

2 WIKI

How the memories are best used

See also

Leave a Comment Cancel reply