8 min readmedium+40 XP

Structured Plan Output and Plan Validation

A 'plan' that is just a paragraph of text is not a plan — it is a hope. Learn the schema GH-600 expects for a structured plan, the validations you can apply to it, and the failure modes a structured plan unlocks.

After this topic, you'll be confident about Structured plan, Plan schema, Plan validation and 1 more concept.

Structured Plan Output and Plan Validation

A plan is a contract, not a paragraph. The exam expects you to recognise a well-formed plan, name the validations applied to it, and explain why structured plans unlock everything else (eval, traceability, replay).

The minimum plan schema

{
  "plan_id": "abc-123",
  "goal": "Triage and label issue #4821",
  "steps": [
    {
      "id": "s1",
      "tool": "github.issues.get",
      "args": { "issue_number": 4821 },
      "precondition": "issue exists and is open",
      "postcondition": "issue body fetched into working memory"
    },
    {
      "id": "s2",
      "tool": "github.issues.addLabels",
      "args": { "issue_number": 4821, "labels": ["bug", "priority:high"] },
      "precondition": "s1 succeeded; classification = bug/high",
      "postcondition": "labels visible on issue"
    }
  ],
  "budget": { "max_steps": 8, "max_tokens": 50000, "wall_seconds": 60 }
}

Notice what is not in the schema: free-form prose, self-rated confidence, hidden tools. Anything not in the schema is rejected at the boundary.

The validations the runtime applies

Validation	What it catches
Schema validity	Malformed plans (missing fields, wrong types).
Tool allow-list	The agent picked a tool it does not have permission for.
Argument policy	E.g., `force=true` is never allowed; resource IDs are within scope.
Budget check	Plan exceeds step / token / time budget before it even starts.
Blast-radius estimate	High-blast-radius plans get escalated to human approval.

Plan-vs-execution divergence

Because the plan is a contract, the runtime can detect when execution diverges. If step s3 tries to call a tool not listed in the plan, you don't tolerate it — you block and log it. This is the single most important reason to use structured plans: it converts "the agent went off-script" from a vibe into a programmatic signal.

Exam tip: any option that retroactively edits the plan during execution is wrong. The plan is the contract; the execution log is separate.

The step ledger

Each executed step writes a row to the ledger:

Field	Purpose
`step_id`	Joins back to the plan.
`started_at` / `finished_at`	Latency + ordering.
`outcome`	`success` / `validation_failed` / `tool_error` / `policy_blocked`.
`artifacts`	Diffs, file paths, response payloads (for inspection).

Together, the plan + ledger are the inspectable artifact a reviewer reads the next morning.

Quick check

1 of 3

+40 XP

Which of the following is closest to the **minimum** structure GH-600 expects in a plan step?

Pick your answer.

Where this shows up on the exam

Expect questions asking you to pick the right plan schema, and questions phrased as "the agent did X that wasn't in the plan — what should the runtime do?". Always: validate, block divergence, log it. Never: retroactively edit, trust the agent's confidence, or apologise to the user.

Anchor concepts

Key terms

Structured plan: A typed, ordered list of steps where each step names the tool to call, the arguments, and the precondition / expected post-condition.
Plan schema: The JSON / type definition the runtime uses to validate that the agent produced a well-formed plan before any step executes.
Plan validation: The set of checks applied to a structured plan: schema-valid, only allow-listed tools, arguments within policy, total cost / blast radius under budget.
Step ledger: The execution log paired with the plan: each step records started_at, finished_at, outcome, and any artifacts produced.

Watch out

Common pitfalls

Storing the plan as free-form markdown so reviewers can read it 'naturally' — then having no programmatic way to validate or replay it.
Letting the agent mutate the plan mid-execution without versioning. Now the post-mortem can't tell which plan was actually run.
Forgetting the precondition/postcondition fields. Without them you cannot detect drift between expected and actual state.