Building a RAG-Augmented Memory System for AI Coding Assistants

Last updated on 06 Apr 2026

Every AI coding assistant starts each session with amnesia. It reads a few config files, maybe some curated memory notes, and then works as if the last hundred sessions never happened. Meanwhile, gigabytes of session histories, research findings, project records, and algorithm reflections sit unused on disk.

I built a RAG-augmented memory system that changes this. It adds a knowledge graph as an L2 cache behind the existing flat-file memory, transforming 1.9GB of write-only archives into queryable institutional memory. Here's how it works.

The Problem: 0.1% Knowledge Utilization

My AI assistant (Claude Code, running as PAI) had accumulated:

654MB of session history (102 session summaries + raw event logs)
1.2GB of session transcripts across 4,000+ JSONL files
11 PRDs (project requirement documents) from past work
16 algorithm reflections (self-improvement entries)
40KB of curated memory files

But each session only loaded ~2KB of that into context. The math was brutal:

1.9GB total data --> 40KB curated memory --> ~2KB loaded per session
                     (2% of data)            (0.1% of data)

The assistant would regularly re-research topics it had already investigated, unable to recall that it had spent 5 minutes on the exact same question two weeks ago.

The Architecture: RAG as L2 Cache

The key insight came from a first-principles analysis: the limiting assumption was that memory must be manually curated and loaded at session start. The existing memory files are great as an L1 cache — fast, curated, always available. But they were treated as the only memory.

The solution adds an L2 layer:

+-----------------------------------------------------+
|  L1 (fast): MEMORY.md + memory files                |
|  - Loaded at session start                           |
|  - 40KB curated, 200-line index                      |
|  - Manually maintained                               |
+-------------------------+---------------------------+
                          | falls through on miss
+-------------------------v---------------------------+
|  L2 (deep): LightRAG Knowledge Graph                |
|  - 2,200+ entity nodes, 1,900+ relationship edges   |
|  - Automatic ingestion via hooks                     |
|  - Semantic + graph-traversal search                 |
|  - Queried on demand via REST API                    |
+-----------------------------------------------------+

The Stack: All Local, All Apple Silicon

Everything runs locally on a MacBook Pro with Apple Silicon — no cloud APIs needed for the RAG layer:

omlx (port 8000) — MLX inference server serving both embeddings and LLM on Apple Silicon. Runs Qwen3-Embedding-4B-4bit-DWQ (2560-dim) for embeddings and Gemma 4 26B for entity extraction and query generation.
LightRAG v1.4.13 (port 9621) — Graph-enhanced RAG engine in a Docker container. Builds a knowledge graph from ingested documents, enabling dual-level retrieval: vector similarity (local facts) + graph traversal (global relationships).
RAGAnything v1.2.10 — Multimodal document processing layer on top of LightRAG. Handles PDFs, Office docs, images, tables, and equations via MinerU parser.

Phase 1: Seeding the Knowledge Graph

The first step was ingesting all existing PAI knowledge into LightRAG. I built a batch ingestion CLI tool (rag-seed.ts) that discovers and processes 8 data sources:

// Data sources for batch ingestion
const SOURCES = {
  memory: "~/.claude/projects/.../memory/*.md",      // 9 curated files
  prds: "~/.claude/MEMORY/WORK/*/PRD.md",             // 11 project records
  reflections: "algorithm-reflections.jsonl",          // 18 self-improvement entries
  sessions: "~/.claude/History/Sessions/**/*.md",      // 102 session summaries
  paidocs: "~/.claude/PAI/*.md",                       // 13 system architecture docs
  skills: "~/.claude/skills/*/SKILL.md",               // 34 skill definitions
  commands: "~/.claude/commands/*.md",                  // 4 slash commands
  claudemds: "~/repos/*/CLAUDE.md",                    // 9 project instructions
};

Not everything should be ingested. A signal classifier gates content into three tiers:

type SignalTier = 'ingest' | 'summarize' | 'discard';

// Marker words that indicate high-signal content
const SIGNAL_MARKERS = [
  "decided", "chose", "because", "error", "fix", "resolved",
  "found that", "architecture", "deploy", "config", "pattern",
  "constraint", "migration", "root cause", "workaround"
];

// Classification rules
// discard: <50 chars, OR 0 markers, OR greeting pattern
// summarize: 1-2 markers
// ingest: 3+ markers

The result: 153 documents ingested, producing 2,232 entity nodes and 1,994 relationship edges. The knowledge graph captures entities like "Mac Mini Server", "Gemma 4 26B", "Dan Elliott" and relationships like "omlx serves Qwen3-Embedding" and "LightRAG uses Docker via colima".

Phase 2: Continuous Ingestion Pipeline

Seeding is a one-time operation. To keep the graph current, I built three auto-ingestion hooks that fire on session end:

// rag-ingest-session.ts (Stop hook)
// Finds most recent session summary, classifies signal, fire-and-forget insert

const summary = findMostRecentSessionSummary();
if (isRecent(summary, 60)) {  // within 60 seconds
  const classification = classifySignal(readFileSync(summary, 'utf-8'));
  if (classification.tier !== 'discard') {
    client.insert(content, `Session summary: ${date}`)
      .catch(() => {});  // fire-and-forget, never block session end
  }
}

The three hooks cover:

Session summaries — Stop hook, ingests the auto-generated session summary
Completed PRDs — PostToolUse hook on Write/Edit, only fires when phase: complete in YAML frontmatter
Algorithm reflections — Stop hook, checks if a new reflection was written in the last 120 seconds

All hooks use the fire-and-forget pattern: they call client.insert().catch(() => {}) and exit immediately without awaiting entity extraction. This keeps hook execution under 2 seconds and never blocks the session end.

Phase 3: Query Before Research

The highest-value feature: automatically querying the knowledge graph before launching research agents. This prevents the AI from re-researching topics it already knows about.

Three hooks inject prior knowledge as <system-reminder> context:

// rag-context-loader.ts (UserPromptSubmit hook)
// On the FIRST prompt of a session, query RAG for relevant context

const flagFile = `/tmp/rag-context-loaded-${process.ppid}.flag`;
if (existsSync(flagFile)) process.exit(0);  // already loaded this session

writeFileSync(flagFile, '');  // prevent re-query on subsequent prompts

const response = await queryWithTimeout(prompt, 3000);  // 3s budget
if (response.length > 100) {
  console.log(`<system-reminder>
[RAG Prior Knowledge] ${response.slice(0, 1000)}
</system-reminder>`);
}

The agent pre-query hook includes a critical infinite-recursion guard:

// Skip RAG-related agents to avoid infinite pre-query loops
const RAG_TERMS = ["rag", "lightrag", "raganything", "knowledge graph"];
if (RAG_TERMS.some(t => prompt.toLowerCase().includes(t))) {
  process.exit(0);  // don't pre-query for RAG-related agents
}

Phase 4: Temporal Decay

Without temporal awareness, a research finding from December 2025 scores identically to one from April 2026. The decay system applies exponential scoring based on content type:

// Scoring formula
const recencyScore = Math.exp(-0.693 * ageDays / halfLife);
const finalScore = 0.7 * semanticScore + 0.3 * recencyScore;

// Content-type-specific half-lives
const HALF_LIVES: Record<ContentType, number> = {
  preference: 180,    // user preferences change slowly
  architecture: 90,   // design decisions semi-stable
  error: 60,          // error patterns become irrelevant after fixes
  research: 30,       // research findings superseded fastest
  general: 60,        // default
};

The half-lives are externalized in a JSON config file, tunable without code changes:

{
  "weights": { "semantic": 0.7, "recency": 0.3 },
  "half_lives_days": {
    "preference": 180, "architecture": 90,
    "error": 60, "research": 30, "general": 60
  },
  "min_score_threshold": 0.1
}

The Results: 2,232 Nodes, 1,994 Edges

After seeding 153 documents (memory files, PRDs, session summaries, skill definitions, READMEs, system docs), the knowledge graph contains:

2,232 entity nodes — people, tools, hosts, models, concepts, projects
1,994 relationship edges — cross-document connections discovered automatically
98MB on disk (vector indices + graph + KV stores)
7 hooks registered across the full session lifecycle

Example query: "How is the local RAG setup?" returns a comprehensive answer drawing from the tools reference memory, the LightRAG setup PRD, the omlx configuration, and the system architecture docs — all synthesized through graph-enhanced retrieval.

Key Gotcha: The /v1 Suffix

The most painful debugging was a 404 error on the embedding endpoint. When LightRAG runs in Docker and omlx runs on the host:

# WRONG - OpenAI SDK appends /embeddings, getting /embeddings (no /v1 prefix)
EMBEDDING_BINDING_HOST=http://host.docker.internal:8000

# RIGHT - SDK appends /embeddings to get /v1/embeddings
EMBEDDING_BINDING_HOST=http://host.docker.internal:8000/v1

The OpenAI Python SDK expects base_url to include /v1. Without it, the SDK calls /embeddings instead of /v1/embeddings, and omlx returns a 404. This cost two failed document insertions before diagnosis.

What's Next

The system is production-ready for personal use. Future improvements include:

Ingesting session transcripts — the 1.2GB of raw conversation data, filtered through the signal classifier with phase-boundary chunking
Memory file auto-generation — using RAG queries to generate and refresh curated memory files (L1 becomes a materialized view of L2)
Multi-agent knowledge sharing — subagents can curl http://localhost:9621/query to access the shared knowledge graph

Resources

LightRAG — Graph-enhanced RAG framework (32k stars)
RAGAnything — Multimodal document processing on LightRAG (15k stars)
omlx — MLX inference server for Apple Silicon
Qwen3-Embedding-4B MLX — The embedding model (2560-dim, 2.1GB)
Claude Code — The CLI assistant this was built for
LightRAG Paper (EMNLP 2025) — The academic paper behind the framework
MemGPT Paper — Tiered memory architecture that inspired the L1/L2 design