Building a RAG-Augmented Memory System for AI Coding Assistants
Every AI coding assistant starts each session with amnesia. It reads a few config files, maybe some curated memory notes, and then works as if the last hundred sessions never happened. Meanwhile, gigabytes of session histories, research findings, project records, and algorithm reflections sit unused on disk.
I built a RAG-augmented memory system that changes this. It adds a knowledge graph as an L2 cache behind the existing flat-file memory, transforming 1.9GB of write-only archives into queryable institutional memory. Here's how it works.
The Problem: 0.1% Knowledge Utilization
My AI assistant (Claude Code, running as PAI) had accumulated:
- 654MB of session history (102 session summaries + raw event logs)
- 1.2GB of session transcripts across 4,000+ JSONL files
- 11 PRDs (project requirement documents) from past work
- 16 algorithm reflections (self-improvement entries)
- 40KB of curated memory files
But each session only loaded ~2KB of that into context. The math was brutal:
1.9GB total data --> 40KB curated memory --> ~2KB loaded per session
(2% of data) (0.1% of data)The assistant would regularly re-research topics it had already investigated, unable to recall that it had spent 5 minutes on the exact same question two weeks ago.
The Architecture: RAG as L2 Cache
The key insight came from a first-principles analysis: the limiting assumption was that memory must be manually curated and loaded at session start. The existing memory files are great as an L1 cache — fast, curated, always available. But they were treated as the only memory.
The solution adds an L2 layer:
+-----------------------------------------------------+
| L1 (fast): MEMORY.md + memory files |
| - Loaded at session start |
| - 40KB curated, 200-line index |
| - Manually maintained |
+-------------------------+---------------------------+
| falls through on miss
+-------------------------v---------------------------+
| L2 (deep): LightRAG Knowledge Graph |
| - 2,200+ entity nodes, 1,900+ relationship edges |
| - Automatic ingestion via hooks |
| - Semantic + graph-traversal search |
| - Queried on demand via REST API |
+-----------------------------------------------------+The Stack: All Local, All Apple Silicon
Everything runs locally on a MacBook Pro with Apple Silicon — no cloud APIs needed for the RAG layer:

- omlx (port 8000) — MLX inference server serving both embeddings and LLM on Apple Silicon. Runs
Qwen3-Embedding-4B-4bit-DWQ(2560-dim) for embeddings andGemma 4 26Bfor entity extraction and query generation. - LightRAG v1.4.13 (port 9621) — Graph-enhanced RAG engine in a Docker container. Builds a knowledge graph from ingested documents, enabling dual-level retrieval: vector similarity (local facts) + graph traversal (global relationships).
- RAGAnything v1.2.10 — Multimodal document processing layer on top of LightRAG. Handles PDFs, Office docs, images, tables, and equations via MinerU parser.
Phase 1: Seeding the Knowledge Graph
The first step was ingesting all existing PAI knowledge into LightRAG. I built a batch ingestion CLI tool (rag-seed.ts) that discovers and processes 8 data sources:
// Data sources for batch ingestion
const SOURCES = {
memory: "~/.claude/projects/.../memory/*.md", // 9 curated files
prds: "~/.claude/MEMORY/WORK/*/PRD.md", // 11 project records
reflections: "algorithm-reflections.jsonl", // 18 self-improvement entries
sessions: "~/.claude/History/Sessions/**/*.md", // 102 session summaries
paidocs: "~/.claude/PAI/*.md", // 13 system architecture docs
skills: "~/.claude/skills/*/SKILL.md", // 34 skill definitions
commands: "~/.claude/commands/*.md", // 4 slash commands
claudemds: "~/repos/*/CLAUDE.md", // 9 project instructions
};Not everything should be ingested. A signal classifier gates content into three tiers:
type SignalTier = 'ingest' | 'summarize' | 'discard';
// Marker words that indicate high-signal content
const SIGNAL_MARKERS = [
"decided", "chose", "because", "error", "fix", "resolved",
"found that", "architecture", "deploy", "config", "pattern",
"constraint", "migration", "root cause", "workaround"
];
// Classification rules
// discard: <50 chars, OR 0 markers, OR greeting pattern
// summarize: 1-2 markers
// ingest: 3+ markersThe result: 153 documents ingested, producing 2,232 entity nodes and 1,994 relationship edges. The knowledge graph captures entities like "Mac Mini Server", "Gemma 4 26B", "Dan Elliott" and relationships like "omlx serves Qwen3-Embedding" and "LightRAG uses Docker via colima".
Phase 2: Continuous Ingestion Pipeline
Seeding is a one-time operation. To keep the graph current, I built three auto-ingestion hooks that fire on session end:
// rag-ingest-session.ts (Stop hook)
// Finds most recent session summary, classifies signal, fire-and-forget insert
const summary = findMostRecentSessionSummary();
if (isRecent(summary, 60)) { // within 60 seconds
const classification = classifySignal(readFileSync(summary, 'utf-8'));
if (classification.tier !== 'discard') {
client.insert(content, `Session summary: ${date}`)
.catch(() => {}); // fire-and-forget, never block session end
}
}The three hooks cover:
- Session summaries —
Stophook, ingests the auto-generated session summary - Completed PRDs —
PostToolUsehook on Write/Edit, only fires whenphase: completein YAML frontmatter - Algorithm reflections —
Stophook, checks if a new reflection was written in the last 120 seconds
All hooks use the fire-and-forget pattern: they call client.insert().catch(() => {}) and exit immediately without awaiting entity extraction. This keeps hook execution under 2 seconds and never blocks the session end.
Phase 3: Query Before Research
The highest-value feature: automatically querying the knowledge graph before launching research agents. This prevents the AI from re-researching topics it already knows about.
Three hooks inject prior knowledge as <system-reminder> context:
// rag-context-loader.ts (UserPromptSubmit hook)
// On the FIRST prompt of a session, query RAG for relevant context
const flagFile = `/tmp/rag-context-loaded-${process.ppid}.flag`;
if (existsSync(flagFile)) process.exit(0); // already loaded this session
writeFileSync(flagFile, ''); // prevent re-query on subsequent prompts
const response = await queryWithTimeout(prompt, 3000); // 3s budget
if (response.length > 100) {
console.log(`<system-reminder>
[RAG Prior Knowledge] ${response.slice(0, 1000)}
</system-reminder>`);
}The agent pre-query hook includes a critical infinite-recursion guard:
// Skip RAG-related agents to avoid infinite pre-query loops
const RAG_TERMS = ["rag", "lightrag", "raganything", "knowledge graph"];
if (RAG_TERMS.some(t => prompt.toLowerCase().includes(t))) {
process.exit(0); // don't pre-query for RAG-related agents
}Phase 4: Temporal Decay
Without temporal awareness, a research finding from December 2025 scores identically to one from April 2026. The decay system applies exponential scoring based on content type:
// Scoring formula
const recencyScore = Math.exp(-0.693 * ageDays / halfLife);
const finalScore = 0.7 * semanticScore + 0.3 * recencyScore;
// Content-type-specific half-lives
const HALF_LIVES: Record<ContentType, number> = {
preference: 180, // user preferences change slowly
architecture: 90, // design decisions semi-stable
error: 60, // error patterns become irrelevant after fixes
research: 30, // research findings superseded fastest
general: 60, // default
};The half-lives are externalized in a JSON config file, tunable without code changes:
{
"weights": { "semantic": 0.7, "recency": 0.3 },
"half_lives_days": {
"preference": 180, "architecture": 90,
"error": 60, "research": 30, "general": 60
},
"min_score_threshold": 0.1
}The Results: 2,232 Nodes, 1,994 Edges
After seeding 153 documents (memory files, PRDs, session summaries, skill definitions, READMEs, system docs), the knowledge graph contains:
- 2,232 entity nodes — people, tools, hosts, models, concepts, projects
- 1,994 relationship edges — cross-document connections discovered automatically
- 98MB on disk (vector indices + graph + KV stores)
- 7 hooks registered across the full session lifecycle
Example query: "How is the local RAG setup?" returns a comprehensive answer drawing from the tools reference memory, the LightRAG setup PRD, the omlx configuration, and the system architecture docs — all synthesized through graph-enhanced retrieval.
Key Gotcha: The /v1 Suffix
The most painful debugging was a 404 error on the embedding endpoint. When LightRAG runs in Docker and omlx runs on the host:
# WRONG - OpenAI SDK appends /embeddings, getting /embeddings (no /v1 prefix)
EMBEDDING_BINDING_HOST=http://host.docker.internal:8000
# RIGHT - SDK appends /embeddings to get /v1/embeddings
EMBEDDING_BINDING_HOST=http://host.docker.internal:8000/v1The OpenAI Python SDK expects base_url to include /v1. Without it, the SDK calls /embeddings instead of /v1/embeddings, and omlx returns a 404. This cost two failed document insertions before diagnosis.
What's Next
The system is production-ready for personal use. Future improvements include:
- Ingesting session transcripts — the 1.2GB of raw conversation data, filtered through the signal classifier with phase-boundary chunking
- Memory file auto-generation — using RAG queries to generate and refresh curated memory files (L1 becomes a materialized view of L2)
- Multi-agent knowledge sharing — subagents can
curl http://localhost:9621/queryto access the shared knowledge graph
Resources
- LightRAG — Graph-enhanced RAG framework (32k stars)
- RAGAnything — Multimodal document processing on LightRAG (15k stars)
- omlx — MLX inference server for Apple Silicon
- Qwen3-Embedding-4B MLX — The embedding model (2560-dim, 2.1GB)
- Claude Code — The CLI assistant this was built for
- LightRAG Paper (EMNLP 2025) — The academic paper behind the framework
- MemGPT Paper — Tiered memory architecture that inspired the L1/L2 design