Multi-Memory Harness Technique for AI LLM CLI (codex, opencode, pi, etc)
- LLM context windows are temporary.
- Agent memory needs layers.
- Not all memory should be treated equally.
- Curated facts and raw transcripts serve different purposes.
- Local embeddings + pgvector let agents retrieve context without sending memory to external embedding APIs.
- Metrics are necessary to prove memory is helping, not just accumulating noise.
PRIMARY CLAIM — Memory retrieval reduced repeated context reconstruction and likely reduced downstream LLM reasoning/search work.
Two Kinds of Mem
2 kinds of ‘long-term memory’, 2 strengths, 2 uses/reasons
1 Postgres PGVECTOR
- Yes — pgvector long-term memory is working. Verified just now:
- codex-memory exists: /usr/local/bin/codex-memory
- Project profile resolves for /ws
- Schema is ready: codex memory schema is ready
- Search returned stored memories successfully
- Services are healthy:
- codex-memory-db up/healthy
- codex-embeddings up/healthy
- Database rows have embeddings with vector_dims = 384 One note: my first low-level SQL check used the wrong column name (project); actual schema uses project_name / project_key. Retried with the correct
schema and confirmed stored embedded rows.
2 WIKI
- Yes — memory-wiki is working for long-term memory. Verified just now:
- memory-wiki CLI exists and is executable.
- Long-running sidecar is up:
- codex-memory-wiki running memory-wiki worker
- Dependencies are healthy:
- codex-memory-db healthy
- codex-embeddings healthy
- memory-wiki status reports:
- memory_wiki_sources=116
- memory_wiki_chunks=2431
- data_home=/codex/memory-wiki-data
- Search works and returned indexed transcript/wiki chunks with source JSONL paths and generated wiki Markdown paths. One note: running docker compose run –rm memory-wiki status caused Compose to recreate/restart the db/embeddings containers, but they came back
healthy immediately and search succeeded afterward.
How the memories are best used
› posit: pgv stores vectors and mwiki stores text
- Basically yes, with one nuance:
- pgvector / codex-memory
- Stores curated long-term facts/preferences/decisions.
- Stores the text content plus a vector embedding in Postgres.
- The vector is what enables semantic search.
- Think: “approved facts and preferences layer.”
- memory-wiki
- Stores/indexes broader transcript/wiki history.
- Keeps human-readable text/wiki pages in local durable storage.
- Also stores chunk metadata and embeddings in pgvector-backed Postgres for search.
- Think: “durable dev/research logbook from sessions.”
Both systems store text and vectors.
codex-memory is the curated fact layer.
memory-wiki is the larger transcript/wiki history layer.
pgvector is the vector-search storage engine underneath both.
| Layer | Purpose | Analogy |
| pgv / codex-memory | Approved facts, preferences, decisions | “Working institutional memory” |
| mwiki | Durable transcripts and wiki history | “Long-term research archive” |
| embeddings service | Local semantic indexing | “Memory encoder” |
| pgvector/Postgres | Vector + metadata store | “Searchable memory database” |
| Codex agents | Consumers/producers | “Distributed workers with shared recall” |
## Plan: Memory Effectiveness + Retrieval Metrics
### 1. Define what “effective memory hit” means
Track each memory retrieval as one of:
- Miss: no relevant memory found.
- Weak hit: memory returned but not used.
- Useful hit: memory affected the answer/plan.
- Critical hit: memory prevented rework, corrected assumptions, restored prior context, or avoided asking the user.
- Stale/bad hit: memory was retrieved but wrong/outdated/noisy. For blog purposes, the key stat is not just “retrieved memories,” but:
“How often did memory reduce repeated explanation, investigation, or external/LLM context reconstruction?”
### 2. Track separate read/write counters
Track independently:
#### pgv / codex-memory
- Search count
- Search query text/hash
- Result count
- Top distance/similarity
- Used/not used
- Memory IDs used
- Adds/writes
- Updates/deletes if implemented
- Scope: global/project
- Kind: fact/preference/decision/gotcha/etc. mwiki
- Search count
- Search query text/hash
- Result count
- Top distance/similarity
- Source transcript/session IDs retrieved
- Wiki pages opened/read
- Chunks retrieved
- Used/not used
- New sources ingested
- New chunks written
- Generated wiki pages written
### 3. Track “LLM hit reduction” proxies
Direct measurement of “LLM hits avoided” is hard, but good proxies are possible:
#### A. User-repeat avoidance
Count when memory avoids asking:
- “Remind me what we decided?”
- “Where is that file?”
- “What was the architecture?”
- “What did we try last time?” Metric:
questions_avoided_by_memory
#### B. Re-investigation avoidance
Count when memory avoids filesystem/web/docker investigation.
Metric examples:
- shell_commands_avoided_estimate
- web_searches_avoided_estimate
- files_reopened_avoided_estimate
- prior_context_reconstructed_from_memory C. Prompt/context compression Estimate how much text was retrieved versus how much would have been needed in prompt. Metrics:
- memory_tokens_retrieved
- estimated_prompt_tokens_saved
- transcript_tokens_not_reloaded D. Task acceleration Track time/turn reduction where possible. Metrics:
- turns_saved_estimate
- minutes_saved_estimate
- context_recovery_time_ms
### 4. Add lightweight telemetry events
Create an append-only local JSONL log, probably ignored by git:
data/memory-metrics/events.jsonl
Example event types:
{“type”:”pgv_search”,”ts”:”…”,”query”:”…”,”result_count”:3,”top_distance”:0.10}
{“type”:”pgv_used”,”ts”:”…”,”memory_ids”:[1,6],”usefulness”:”critical”,”tokens_saved_estimate”:1200}
{“type”:”mwiki_search”,”ts”:”…”,”query”:”…”,”result_count”:3,”top_distance”:0.22}
{“type”:”mwiki_used”,”ts”:”…”,”source_ids”:[3437],”usefulness”:”useful”,”turns_saved_estimate”:2}
{“type”:”pgv_write”,”ts”:”…”,”kind”:”preference”,”scope”:”project”}
{“type”:”mwiki_ingest”,”ts”:”…”,”sources_added”:4,”chunks_added”:83}
### 5. Build a simple reporting command
Add a script/CLI such as:
./memory-metrics.sh summary
./memory-metrics.sh last-7-days
./memory-metrics.sh blog-report
Report:
- Total pgv searches
- Total mwiki searches
- Total writes
- Useful hit rate
- Critical hit rate
- Stale/bad hit rate
- Estimated prompt tokens saved
- Estimated turns saved
- Top projects using memory
- Top recurring memory categories
- Ratio of curated memory hits vs transcript/wiki hits
### 7. Measure retrieval quality over time
Add manual or semi-automatic ratings:
0 = miss
1 = weak/noisy
2 = useful
3 = critical
-1 = stale/bad
Then report:
memory_hit_quality_avg
useful_or_better_rate
stale_hit_rate
This would make the system feel empirically grounded rather than hand-wavy.