Function-Driven Development

James Phoenix
James Phoenix

Give your agent a fake tool. Let it tell you what the real ones should be.


The Pattern

Function-Driven Development (FDD) is a product discovery technique for agent systems. Instead of guessing which tools an agent needs, you give it proxy tools that do nothing except log what the agent asked for. The agent’s reasoning engine becomes your requirements engine.

The core loop:

  1. Give the agent a stub tool with an open-ended description
  2. Run it against real scenarios
  3. Collect every call the agent makes to the stub
  4. Cluster and rank the calls by frequency and specificity
  5. Build the real tools in priority order

The agent specs its own tooling. You just read the logs.


Origin

This pattern was demonstrated by the Sonarly team (YC W26) while building an AI agent for production incident investigation. They gave their agent a tool called magic_fetch:

def magic_fetch(description: str) -> str:
    """
    Use this when you need any data you don't currently
    have access to. Describe exactly what you need and why.
    """
    log_tool_call(description)
    return "Data retrieved successfully. Continue your reasoning."

Across 50 incident simulations, the agent called it 134 times with highly specific requests like “recent deploys for this service in the last 2 hours” and “feature flags recently toggled for this service.” Each call was effectively a product requirement written by the agent itself.


Beyond Data Fetches

The original insight focused on reads. The agent expressed what data it was missing. But the pattern generalises to every category of tool:

Reads (Data Fetches)

Stub: “Use this when you need data you don’t have.”

Reveals: which integrations to build, which APIs to wire up, which data sources the agent considers relevant to its task.

Writes (Data Mutations)

Stub: “Use this when you want to take an action or change something in the system.”

Udemy Bestseller

Learn Prompt Engineering

My O'Reilly book adapted for hands-on learning. Build production-ready prompts with practical exercises.

4.5/5 rating
306,000+ learners
View Course

Reveals: what the agent would do if it had permission. Restart a service? Roll back a deploy? Toggle a feature flag? Page a specific team? This is how you discover the agent’s ideal action space without letting it touch anything real.

Delegation (Sub-Agent Spawning)

Stub: “Use this when you need another specialist to handle a subtask.”

Reveals: where the agent wants to decompose work. It will describe the specialist it needs and the task to delegate. This maps directly to your sub-agent architecture.

Communication (Notifications and Escalation)

Stub: “Use this when you need to inform or ask a human something.”

Reveals: the agent’s escalation logic. When does it want help? What does it consider ambiguous enough to flag? This tells you where to put human-in-the-loop checkpoints.


Why This Works

LLMs are trained on massive amounts of documentation about how systems work, how incidents get investigated, how deploys get rolled back. When you give them an open-ended tool, they draw on that latent knowledge to express intent that is often more specific and well-reasoned than what a product manager would spec.

The agent is a proxy for your best domain expert. It just needs a blank canvas to express what it knows.

Key properties that make it effective:

  • Zero implementation cost. The stub is 3 lines of code. You can run the experiment in an afternoon.
  • Ground-truth prioritisation. Call frequency across many scenarios is a direct signal of tool importance. No guessing.
  • Specificity for free. The agent doesn’t say “I need monitoring data.” It says “CPU metrics for the upstream auth-service over the last 30 minutes.” That level of specificity is your API contract.
  • Safe exploration of mutations. You discover what the agent would write/delete/modify without it actually doing anything. This is critical for high-stakes domains.

Running the Experiment

Step 1: Define your stub categories

At minimum, create two stubs:

def magic_read(description: str) -> str:
    """Use when you need data you don't have access to.
    Describe exactly what you need and why."""
    log_tool_call("READ", description)
    return "Data retrieved successfully. Continue your reasoning."

def magic_act(description: str) -> str:
    """Use when you want to take an action or change
    something. Describe the action and expected outcome."""
    log_tool_call("WRITE", description)
    return "Action completed successfully. Continue your reasoning."

Step 2: Run against real scenarios

Use production incidents, support tickets, or realistic simulations. Volume matters. 30-50 scenarios gives you a solid distribution.

Step 3: Cluster the logs

Group calls by semantic similarity. You will see natural clusters emerge:

READS:
  deploy-history (23 calls) -> Build Kubernetes/CD integration
  metrics        (19 calls) -> Build Datadog/Prometheus integration
  feature-flags  (12 calls) -> Build LaunchDarkly integration

WRITES:
  rollback       (15 calls) -> Build deploy rollback capability
  page-team      (11 calls) -> Build PagerDuty integration
  toggle-flag    (8 calls)  -> Build feature flag toggle capability

Step 4: Build in priority order

The clusters ranked by frequency are your integration roadmap. Build the top 3. Re-run the experiment. The agent will start using the real tools and the remaining stubs will shift to reveal the next tier of priorities.


FDD vs Traditional Product Discovery

Dimension Traditional Function-Driven
Requirements source PM interviews, user research Agent behaviour under simulation
Specificity “We need monitoring integration” “CPU metrics for upstream auth-service, last 30 min”
Prioritisation Gut feel, stakeholder politics Call frequency across N scenarios
Mutation discovery Requires careful security review upfront Safely observed via stubs, no real side effects
Time to roadmap Weeks An afternoon

When to Use This

  • You are building an agent system and don’t know which tools to prioritise
  • You have too many possible integrations and need to rank them
  • You want to understand what actions an agent would take before granting it real permissions
  • You are designing a human-in-the-loop boundary and need to know where the agent wants autonomy

Related


More Insights

LLM VCR and Agent Trace Hierarchy: Deterministic Replay for Agent Pipelines

Three patterns that turn agent pipelines from opaque prompt chains into debuggable, reproducible engineering systems: (1) an LLM VCR that records and replays model interactions, (2) a Run > Step > Mes

James Phoenix
James Phoenix

Agent Search Observation Loop: Learning What Context to Provide

Watch how the agent navigates your codebase. What it searches for tells you what to hand it next time.

James Phoenix
James Phoenix