GraphRAG for Production Engineer Agents

James Phoenix
James Phoenix

Your agent’s reasoning is fine. Its memory isn’t. GraphRAG turns organizational knowledge into a connected graph that agents can traverse for incident response.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course

Source: Decoding AI Magazine – Anca Ioana Muscalagiu | Date: January 20, 2026


The Problem

Production incidents aren’t slowed by the lack of a fix. They’re slowed by the lack of clarity.

Context is scattered across Slack threads, Confluence pages, half-written runbooks, and the memories of engineers who may have left the company. When a pager fires at 02:13, engineers spend more time reconstructing context than actually resolving the issue.

The knowledge that holds systems together is relational: services depend on services, teams own systems, incidents recur in patterns. A flat vector store retrieves similar text but cannot express ownership chains, dependency graphs, or blast radius.

What Is GraphRAG?

GraphRAG is RAG where retrieval is guided by graph structure, not just similarity scores.

Approach Retrieval Method Good For
Traditional RAG Semantically similar text chunks from vector DB Point lookups, specific facts
GraphRAG Connected knowledge via graph traversal Coverage questions, dependency chains, synthesis across systems

Traditional RAG answers: “What’s the most relevant chunk?”
GraphRAG answers: “What do we know about this issue across teams, services, and history?”

Two-Phase Architecture

Phase 1: Graph Generation (Offline)

  1. Source Documents to Text Chunks. Break runbooks, postmortems, architecture docs into indexable pieces.
  2. Text Chunks to Element Instances. Extract entities (services, teams, incidents) and relationships (DEPENDS_ON, OWNED_BY) from each chunk.
  3. Element Instances to Element Summaries. LLM generates concise summaries for each entity and relationship. These summaries become the node/edge properties and the basis for embeddings.
  4. Element Summaries to Graph Communities. Cluster the graph using hierarchical Leiden. Communities naturally align with operational boundaries (a platform area, a group of interdependent services).
  5. Graph Communities to Community Summaries. LLM summarizes each community. These become the primary retrieval units.

Phase 2: Query Answering (Runtime)

  1. Semantic entry point lookup. Use embeddings to find the most relevant nodes for the alert.
  2. Graph expansion. Traverse edges (DEPENDS_ON, OWNED_BY, HAS_RUNBOOK) to capture blast radius and context.
  3. Community-level synthesis. Identify relevant communities, generate intermediate answers from each, merge into a global response.

Knowledge Graph Schema

Nodes:

Type Properties
Service name, domain, tier, repo, tags, embedding
Team name, oncall channel, owners, embedding
Incident id, timestamp, severity, summary, embedding
Runbook url, title, steps summary, embedding
Doc source, url, title, embedding
Release/PR id, timestamp, author, summary, embedding

Relationships:

Edge Direction
DEPENDS_ON Service -> Service
OWNED_BY Service -> Team
AFFECTED Incident -> Service
RESPONDED_BY Incident -> Team
HAS_RUNBOOK Service -> Runbook
DOCUMENTED_IN Service/Incident -> Doc
RELATED_TO Incident <-> Incident
INTRODUCED_BY Incident/Service -> Release/PR

Each node carries a vector embedding derived from its LLM-generated summary, not from raw documents.

Concrete Example: Payments API 5xx Spike

Input: A Confluence runbook about “Payments API 5xx spike after deploy.”

Extracted entities:

  • Service: payments-api, auth-service, ledger-service
  • Team: Payments Platform
  • Runbook: “Payments API 5xx spike after deploy”

Extracted relationships:

  • payments-api DEPENDS_ON auth-service
  • payments-api DEPENDS_ON ledger-service
  • payments-api OWNED_BY Payments Platform
  • payments-api HAS_RUNBOOK “Payments API 5xx spike after deploy”

At query time (alert: payments-api 5xx spike):

MATCH (s:Service {name: "payments-api"})
OPTIONAL MATCH (s)-[:DEPENDS_ON]->(dep:Service)
OPTIONAL MATCH (s)-[:OWNED_BY]->(t:Team)
OPTIONAL MATCH (s)-[:HAS_RUNBOOK]->(r:Runbook)
RETURN s, collect(dep) AS dependencies, t AS owner, collect(r) AS runbooks

Bounded expansion for blast radius:

MATCH (s:Service {name: "payments-api"})-[:DEPENDS_ON*1..2]->(dep:Service)
RETURN s, collect(DISTINCT dep) AS deps_2_hops

This reconstructs a slice of the system: which services are involved, how far the blast radius extends, who owns what, and which operational knowledge applies.

System Architecture

Five components with clear boundaries:

  1. Alerting System. Prometheus detects threshold breaches, routes via Alertmanager to FastAPI webhook. Single entry point for all incidents.
  2. Agent Component. FastAPI server + Agent Controller. Orchestrates GraphRAG queries, MCP tool calls, and LLM inference. Custom explicit agent loop (no framework), making behavior predictable and debuggable.
  3. GraphRAG Component. Neo4j graph database with vector embeddings on nodes. Graph Query Engine performs semantic search + traversal.
  4. MCP Servers. Global MCP router forwards to specialized servers: Confluence (docs), GitHub (code changes), Slack (history + notifications), Prometheus (live metrics).
  5. Observability. Opik traces prompt monitoring, tool usage, and retrieval latency.

Data Flow

  1. Prometheus alert fires webhook to FastAPI
  2. Agent Controller queries GraphRAG for related services/teams
  3. Graph Query Engine: semantic search + edge traversal
  4. Agent sends tool plan to LLM (Gemini)
  5. MCP servers return live data (metrics, recent deploys, Slack discussions, docs)
  6. LLM synthesizes structured incident report
  7. Slack MCP posts to affected team channels

Steps 4-6 can loop as the LLM requests additional tool calls.

Graph vs MCP: Data Priority

The graph holds structure and history. MCP holds what’s happening right now.

Priority order for conflicting information:

  1. MCP servers provide current state (deployments, metrics, discussions)
  2. Graph provides historical patterns and documented structure
  3. If they conflict, use MCP data and flag the discrepancy

Maintenance cadence: Build the graph once from existing docs, then run daily syncs. Production topology changes slowly enough that nightly updates suffice.

Tech Stack Choices

Component Choice Rationale
App server FastAPI Async by default, I/O-heavy workloads
Agent orchestration Custom controller Explicit loop, no hidden abstractions from frameworks
Graph DB Neo4j Native graph traversal + vector indexing on nodes
Retrieval LlamaIndex PropertyGraph Built-in agentic GraphRAG support
LLM Gemini via gateway Provider-agnostic abstraction layer
Observability Opik End-to-end trace logging for agent behavior

Key design decision: custom agent controller over LangChain/LangGraph. Frameworks hide execution order and error handling behind abstractions that become liabilities in production. For incident response, behavior must be predictable and debuggable.

When to Use GraphRAG vs Vector RAG

Use GraphRAG when:

  • Questions are about coverage and synthesis, not similarity (“What do we know across teams?”)
  • The domain is inherently relational (services, dependencies, ownership)
  • You need to trace propagation paths (blast radius, dependency chains)
  • Context is fragmented across many systems

Use traditional RAG when:

  • Questions target specific facts in specific documents
  • Relationships between entities aren’t central to the answer
  • The corpus is flat and doesn’t have meaningful graph structure

Key Insight

The retrieval step becomes an act of structural reasoning over the organization itself. We are no longer pulling “relevant documents.” We are reconstructing a slice of the system.


Related

Topics
Ai AgentsGraph TraversalGraphragIncident ResponseKnowledge Graphs

More Insights

Cover Image for Own Your Control Plane

Own Your Control Plane

If you use someone else’s task manager, you inherit all of their abstractions. In a world where LLMs make software a solved problem, the cost of ownership has flipped.

James Phoenix
James Phoenix
Cover Image for Indexed PRD and Design Doc Strategy

Indexed PRD and Design Doc Strategy

A documentation-driven development pattern where a single `index.md` links all PRDs and design documents, creating navigable context for both humans and AI agents.

James Phoenix
James Phoenix