Frontmatter as Document Schema: Why Your Knowledge Base Needs Type Signatures

James Phoenix
James Phoenix

Summary

Frontmatter is structured metadata at the top of a file that declares what a document is, what it contains, and how it should be discovered. In agent-driven systems, frontmatter serves the same role that type definitions serve in code: it is the specification layer that makes documents machine-addressable. This article explains frontmatter through two lenses: memory engineering (where frontmatter defines the data model for document retrieval) and spec-driven development (where frontmatter acts as the contract that constrains agent search space before a document body is ever read).

The Problem

Most knowledge bases are untyped. Files sit in folders with names like notes.md or ideas.md. The only way to know what a file contains is to open it and read the whole thing. Search is brute-force. Classification is guesswork.

This is the document equivalent of writing code without type signatures.

// Untyped code: what does this function return?
function processData(data) {
  // Read the entire implementation to find out
}

// Untyped document: what does this file contain?
# Some Notes
// Read the entire body to find out

For humans browsing a personal wiki, this is annoying but manageable. For AI agents retrieving knowledge across hundreds of files, it is a structural bottleneck. The agent must load entire documents into its context window just to determine relevance. Token budgets burn on files that turn out to be irrelevant. Signal drowns in noise.

The deeper problem: without metadata, retrieval systems depend entirely on body text embeddings. Vector search over raw content works for approximate matches, but it cannot filter by document type, validation level, difficulty, or recency without reading every file first.

The Solution

Frontmatter turns every file into a typed, queryable record.

---
title: "Agent Memory Patterns"
slug: "agent-memory-patterns"
taxonomy: "PATTERN"
tags: ["memory", "checkpoint", "state-persistence"]
categories: ["agent-architecture"]
difficulty: "ADVANCED"
createdDate: "2026-01-28"
---

This block, placed between --- delimiters at the top of a markdown file, declares the document’s interface before the body begins. Any system that reads the file (a static site generator, a note-taking app, a RAG pipeline, an MCP server) can parse these fields without touching the content.

The format is almost always YAML, though TOML (+++ delimiters) and JSON also exist. YAML dominates because it is readable, widely supported, and familiar to anyone who has written a CI config.

Frontmatter Through the Memory Engineering Lens

Memory engineering treats agent memory as a data modelling problem. Every memory has four properties: schema, scope, TTL, and retrieval path. Frontmatter encodes all four at the document level.

Memory Property What It Answers Frontmatter Field Example
Schema What shape is this? taxonomy, type taxonomy: "PATTERN" vs "CONCEPT"
Scope Who does it apply to? categories, difficulty categories: ["agent-architecture"]
TTL When does it expire? createdDate, lastUpdated Recency queries surface fresh content
Retrieval path How does the agent find it? title, tags, slug BM25 + semantic search index these fields

Without frontmatter, your knowledge base is an append-only notebook. Every document is equal. There is no way to distinguish a proven production pattern from a speculative idea without reading both in full.

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course

With frontmatter, every document self-describes its role in the memory system. A validationLevel: "PROVEN" field means an agent can prioritize battle-tested patterns over speculation. A difficulty: "ADVANCED" field means a tutoring agent can filter content by the learner’s level. The documents become rows in a queryable store, not files in a folder.

The Lifecycle Connection

Memory engineering emphasizes that memories follow a lifecycle:

Create → Store → Retrieve → Validate → Expire/Promote

Frontmatter participates in every stage:

  • Create: Fields like createdDate timestamp when knowledge entered the system
  • Store: Tags and categories determine where the document sits in the knowledge graph
  • Retrieve: Title, tags, and taxonomy are indexed by search systems (BM25, vector, hybrid)
  • Validate: validationLevel tracks whether the knowledge has been proven or is still speculative
  • Expire/Promote: lastUpdated reveals staleness. A document last updated 18 months ago about “best practices for GPT-4” is probably outdated

Without these fields, staleness detection requires reading every document and making a judgment call. With them, a single query surfaces what needs review.

Frontmatter Through the Spec-Driven Lens

Spec-driven development argues that specifications constrain the agent’s search space and produce better trajectories. The core insight: the more constraints you encode upfront, the smaller the space the agent needs to explore.

Frontmatter applies this same principle to knowledge retrieval.

A document without frontmatter is like a function without a type signature. The agent must read the entire implementation to understand what it does. A document with frontmatter is like a typed function. The agent knows the interface before touching the body.

Spec-driven code:     Types → Implementation → Tests
Spec-driven docs:     Frontmatter → Body → Links

Both follow the same inversion: declare the contract first, fill in the details second.

Frontmatter as Retrieval Specification

When you define frontmatter fields, you are writing a retrieval spec. You are declaring the axes along which a document should be discoverable.

---
title: "Agent Memory Patterns: Checkpoint, Resume, and State Persistence"
taxonomy: "PATTERN"
validationLevel: "PROVEN"
difficulty: "ADVANCED"
tags: ["agent-memory", "checkpoint-resume", "state-persistence"]
categories: ["architecture", "patterns", "agent-development"]
---

This spec tells any retrieval system:

  • What it is: a PATTERN (not a tutorial, not a concept explainer)
  • How reliable it is: PROVEN (tested in production, not just theorized)
  • What it covers: agent-memory, checkpoint-resume, state-persistence
  • Where it fits: architecture and agent-development
  • Who it’s for: advanced practitioners

Without this spec, retrieval relies entirely on body text embeddings. The search system cannot distinguish a beginner tutorial from an advanced pattern reference without reading both. With frontmatter, it can filter before ranking. This is the same principle as type-checking before running tests: cheaper verification first, expensive verification second.

Constraining the Search Space

The math mirrors spec-driven code generation. Better specs reduce the search space:

Without frontmatter:
  Agent query → scan all 200 files → rank by embedding similarity → hope top-5 are relevant

With frontmatter:
  Agent query → filter by taxonomy="PATTERN" → filter by tags ∩ query → rank filtered set
  Search space: 200 → 30 → 5 (97.5% reduction)

This is not a theoretical improvement. In practice, filtering by metadata fields before ranking by content similarity produces dramatically better retrieval precision. The agent gets the right document faster and spends fewer tokens on irrelevant context.

Progressive Disclosure Starts With Frontmatter

Frontmatter is Level 1 in the progressive disclosure pattern. The architecture has three layers:

Level 1: Frontmatter (always indexed, ~50-100 tokens per file)
   ↓ agent decides relevance
Level 2: Document body (loaded on demand, ~500-2000 tokens)
   ↓ agent follows references
Level 3: Linked resources (loaded as needed)

This is exactly how Claude Code skills work. SKILL.md files have YAML frontmatter with name, description, and triggers. The agent reads metadata at startup to build an index of available capabilities. When a user request matches a trigger, the agent loads the full skill body. Without frontmatter, the agent would need to load every skill file into context on every request.

The token economics are significant:

10 skills × 50 tokens metadata = 500 tokens always loaded
10 skills × 1500 tokens full   = 15,000 tokens always loaded

Savings: 96.7%

For a knowledge base with hundreds of documents, the difference between “index metadata” and “load everything” is the difference between a viable system and one that exhausts its context window before starting work.

Practical Schema Design

Minimum Viable Frontmatter

Three fields give you 80% of the value for a personal knowledge base:

---
title: "Descriptive Title That Conveys the Core Insight"
tags: ["topic-a", "topic-b"]
date: "2026-03-16"
---
  • title: searchable, displayable, unambiguous. A good title is a one-line summary.
  • tags: filterable axes for retrieval. Keep to 3-7 per document.
  • date: enables recency sorting and staleness detection.

This is the starting point. Add fields only when you have a retrieval need they serve.

Full Specification Frontmatter

For agent-indexed knowledge bases where retrieval precision matters:

---
title: "Pattern Name: What It Does in One Line"
slug: "pattern-name"
taxonomy: "PATTERN"
validationLevel: "PROVEN"
difficulty: "INTERMEDIATE"
readingTime: 8
tags: ["tag-a", "tag-b", "tag-c"]
categories: ["category-a", "category-b"]
createdDate: "2026-03-16"
lastUpdated: "2026-03-16"
---

Each field serves a specific retrieval function:

Field Retrieval Function
title Display and full-text search
slug URL generation and deduplication
taxonomy Type filtering (PATTERN vs CONCEPT vs SOLUTION)
validationLevel Confidence filtering (PROVEN vs SPECULATIVE)
difficulty Audience matching
readingTime Context budget estimation
tags Topic-based retrieval
categories Hierarchical classification
createdDate Recency ranking
lastUpdated Staleness detection

Schema Consistency Matters

The single most important rule: use the same field names across every file.

# Bad: inconsistent across files
---
date: "2026-03-16"        # file A
createdDate: "2026-03-16"  # file B
created: "2026-03-16"      # file C
---

# Good: one schema, used everywhere
---
createdDate: "2026-03-16"  # every file
---

Inconsistent field names break queries. If half your files use date and half use createdDate, a filter on either field misses half your documents. Define the schema once. Follow it.

Where Frontmatter Lives in Practice

Static Site Generators

Hugo, Jekyll, Astro, Next.js MDX, and Gatsby all parse frontmatter to generate pages. The title becomes the page title. The slug becomes the URL. Tags generate taxonomy pages. This is the original use case and still the most common.

Note-Taking Apps

Obsidian reads frontmatter as “properties.” Tags in frontmatter appear in the tag pane. Custom fields power Dataview queries (e.g., “show all documents where taxonomy = PATTERN and validationLevel = PROVEN“). Logseq and Notion have similar metadata systems.

RAG Pipelines

Retrieval-Augmented Generation systems can parse frontmatter to create structured metadata alongside vector embeddings. This enables hybrid retrieval: filter by metadata fields first, then rank by semantic similarity within the filtered set.

Agent Skill Systems

Claude Code, Codex, and similar agent frameworks use frontmatter in skill files to enable progressive disclosure. The agent reads metadata to determine which skills are relevant, then loads full instructions only for the activated skill.

Common Mistakes

Inconsistent field names across files. If some files use date and others use createdDate, queries break. Define your schema once and stick to it.

Tags without taxonomy. Flat tag lists grow unbounded. After 200 documents, you have 150 unique tags and no structure. Use a controlled vocabulary or add a taxonomy field for coarse-grained classification alongside fine-grained tags.

Frontmatter without indexing. Frontmatter only works if something reads it. In Obsidian, this is automatic. For custom systems, you need to reindex after adding documents. Unindexed frontmatter is a spec nobody checks, which is the same as no spec at all.

Over-specifying. Ten frontmatter fields on a quick scratch note adds friction without value. Match schema complexity to document importance. Quick notes get three fields. Core reference documents get the full schema.

Treating frontmatter as content. Frontmatter is metadata, not a summary. It should contain classifiers and identifiers, not paragraphs of description. The summary belongs in the body, usually as the first section.

The Deeper Point

Frontmatter, type signatures, and specifications are all instances of the same principle: declare the contract before the implementation.

In code, types constrain the implementation search space. An agent generating code against a typed interface explores thousands of possibilities instead of millions.

In documents, frontmatter constrains the retrieval search space. An agent searching a knowledge base with rich metadata filters to dozens of candidates instead of hundreds.

In both cases, the upfront cost is small (a few lines of YAML, a few type definitions) and the downstream benefit compounds. Every new document with consistent frontmatter makes the entire knowledge base more queryable. Every new type definition makes the entire codebase more navigable.

The engineers who invest in frontmatter schemas are doing the same thing as the engineers who invest in type-driven development. They are writing specifications that make machines more effective at the expensive part (search, retrieval, implementation) by spending human effort on the cheap part (declaring intent).

Related

References

Topics
Agent ArchitectureDocument SchemaFrontmatterKnowledge ManagementMemory EngineeringMetadataProgressive DisclosureRetrievalSpec Driven Development

More Insights

Cover Image for Pre-Commit Integration Tests: The LLM Regression Gate

Pre-Commit Integration Tests: The LLM Regression Gate

Pre-commit hooks that run integration tests are the sweet spot for preventing LLM-caused regressions from ever being committed. Pair this with linter configs that treat violations as errors (not warni

James Phoenix
James Phoenix
Cover Image for AI Leverage Without Skill Atrophy

AI Leverage Without Skill Atrophy

Manual coding keeps the skill alive. Systems thinking is needed. Long term you need to leverage AI and leverage your brain. Not outsource thinking.

James Phoenix
James Phoenix