Beyond Chat History: A Multi-Layer Memory Stack for Dockerized AI Agents

Multi-Memory Harness Technique for AI LLM CLI (codex, opencode, pi, etc)

  • LLM context windows are temporary.
  • Agent memory needs layers.
  • Not all memory should be treated equally.
  • Curated facts and raw transcripts serve different purposes.
  • Local embeddings + pgvector let agents retrieve context without sending memory to external embedding APIs.
  • Metrics are necessary to prove memory is helping, not just accumulating noise.

PRIMARY CLAIM — Memory retrieval reduced repeated context reconstruction and likely reduced downstream LLM reasoning/search work.

Two Kinds of Mem

2 kinds of ‘long-term memory’, 2 strengths, 2 uses/reasons

1 Postgres PGVECTOR

  • Yes — pgvector long-term memory is working. Verified just now:
  • codex-memory exists: /usr/local/bin/codex-memory
  • Project profile resolves for /ws
  • Schema is ready: codex memory schema is ready
  • Search returned stored memories successfully
  • Services are healthy:
    • codex-memory-db up/healthy
    • codex-embeddings up/healthy
  • Database rows have embeddings with vector_dims = 384 One note: my first low-level SQL check used the wrong column name (project); actual schema uses project_name / project_key. Retried with the correct
    schema and confirmed stored embedded rows.

2 WIKI

  • Yes — memory-wiki is working for long-term memory. Verified just now:
  • memory-wiki CLI exists and is executable.
  • Long-running sidecar is up:
    • codex-memory-wiki running memory-wiki worker
  • Dependencies are healthy:
    • codex-memory-db healthy
    • codex-embeddings healthy
  • memory-wiki status reports:
    • memory_wiki_sources=116
    • memory_wiki_chunks=2431
    • data_home=/codex/memory-wiki-data
  • Search works and returned indexed transcript/wiki chunks with source JSONL paths and generated wiki Markdown paths. One note: running docker compose run –rm memory-wiki status caused Compose to recreate/restart the db/embeddings containers, but they came back
    healthy immediately and search succeeded afterward.

How the memories are best used

› posit: pgv stores vectors and mwiki stores text

  • Basically yes, with one nuance:
  • pgvector / codex-memory
    • Stores curated long-term facts/preferences/decisions.
    • Stores the text content plus a vector embedding in Postgres.
    • The vector is what enables semantic search.
    • Think: “approved facts and preferences layer.”
  • memory-wiki
    • Stores/indexes broader transcript/wiki history.
    • Keeps human-readable text/wiki pages in local durable storage.
    • Also stores chunk metadata and embeddings in pgvector-backed Postgres for search.
    • Think: “durable dev/research logbook from sessions.”
    So the more precise model is:

Both systems store text and vectors.
codex-memory is the curated fact layer.
memory-wiki is the larger transcript/wiki history layer.
pgvector is the vector-search storage engine underneath both.


LayerPurposeAnalogy
pgv / codex-memoryApproved facts, preferences, decisions“Working institutional memory”
mwikiDurable transcripts and wiki history“Long-term research archive”
embeddings serviceLocal semantic indexing“Memory encoder”
pgvector/PostgresVector + metadata store“Searchable memory database”
Codex agentsConsumers/producers“Distributed workers with shared recall”


## Plan: Memory Effectiveness + Retrieval Metrics

### 1. Define what “effective memory hit” means

Track each memory retrieval as one of:

  • Miss: no relevant memory found.
  • Weak hit: memory returned but not used.
  • Useful hit: memory affected the answer/plan.
  • Critical hit: memory prevented rework, corrected assumptions, restored prior context, or avoided asking the user.
  • Stale/bad hit: memory was retrieved but wrong/outdated/noisy. For blog purposes, the key stat is not just “retrieved memories,” but:

“How often did memory reduce repeated explanation, investigation, or external/LLM context reconstruction?”

### 2. Track separate read/write counters

Track independently:

#### pgv / codex-memory

  • Search count
  • Search query text/hash
  • Result count
  • Top distance/similarity
  • Used/not used
  • Memory IDs used
  • Adds/writes
  • Updates/deletes if implemented
  • Scope: global/project
  • Kind: fact/preference/decision/gotcha/etc. mwiki
  • Search count
  • Search query text/hash
  • Result count
  • Top distance/similarity
  • Source transcript/session IDs retrieved
  • Wiki pages opened/read
  • Chunks retrieved
  • Used/not used
  • New sources ingested
  • New chunks written
  • Generated wiki pages written

### 3. Track “LLM hit reduction” proxies

Direct measurement of “LLM hits avoided” is hard, but good proxies are possible:

#### A. User-repeat avoidance

Count when memory avoids asking:

  • “Remind me what we decided?”
  • “Where is that file?”
  • “What was the architecture?”
  • “What did we try last time?” Metric:

questions_avoided_by_memory

#### B. Re-investigation avoidance

Count when memory avoids filesystem/web/docker investigation.

Metric examples:

  • shell_commands_avoided_estimate
  • web_searches_avoided_estimate
  • files_reopened_avoided_estimate
  • prior_context_reconstructed_from_memory C. Prompt/context compression Estimate how much text was retrieved versus how much would have been needed in prompt. Metrics:
  • memory_tokens_retrieved
  • estimated_prompt_tokens_saved
  • transcript_tokens_not_reloaded D. Task acceleration Track time/turn reduction where possible. Metrics:
  • turns_saved_estimate
  • minutes_saved_estimate
  • context_recovery_time_ms

### 4. Add lightweight telemetry events

Create an append-only local JSONL log, probably ignored by git:

data/memory-metrics/events.jsonl

Example event types:

{“type”:”pgv_search”,”ts”:”…”,”query”:”…”,”result_count”:3,”top_distance”:0.10}
{“type”:”pgv_used”,”ts”:”…”,”memory_ids”:[1,6],”usefulness”:”critical”,”tokens_saved_estimate”:1200}
{“type”:”mwiki_search”,”ts”:”…”,”query”:”…”,”result_count”:3,”top_distance”:0.22}
{“type”:”mwiki_used”,”ts”:”…”,”source_ids”:[3437],”usefulness”:”useful”,”turns_saved_estimate”:2}
{“type”:”pgv_write”,”ts”:”…”,”kind”:”preference”,”scope”:”project”}
{“type”:”mwiki_ingest”,”ts”:”…”,”sources_added”:4,”chunks_added”:83}

### 5. Build a simple reporting command

Add a script/CLI such as:

./memory-metrics.sh summary
./memory-metrics.sh last-7-days
./memory-metrics.sh blog-report

Report:

  • Total pgv searches
  • Total mwiki searches
  • Total writes
  • Useful hit rate
  • Critical hit rate
  • Stale/bad hit rate
  • Estimated prompt tokens saved
  • Estimated turns saved
  • Top projects using memory
  • Top recurring memory categories
  • Ratio of curated memory hits vs transcript/wiki hits

### 7. Measure retrieval quality over time

Add manual or semi-automatic ratings:

0 = miss
1 = weak/noisy
2 = useful
3 = critical
-1 = stale/bad

Then report:

memory_hit_quality_avg
useful_or_better_rate
stale_hit_rate

This would make the system feel empirically grounded rather than hand-wavy.

See also

Visited 1 times, 1 visit(s) today

Leave a Comment