8 min readmedium+40 XP

Define Inputs, Outputs and Success Criteria

Before you write a single prompt, you need a typed contract: what comes in, what goes out, and how you decide the run was a success. This topic gives you the framework GH-600 expects you to apply.

After this topic, you'll be confident about Input contract, Output contract, Success criteria and 1 more concept.

Define Inputs, Outputs and Success Criteria

An agent without a typed contract is a glorified autocomplete. The exam frames the architecture work as three artifacts you produce before writing prompts:

The input contract — what the agent is allowed to start with, with provenance.
The output contract — what shape the agent's final artifact must take.
The success criteria — programmatic checks that decide whether the run met the goal.

Inputs: type and trust

Every input the agent sees has two attributes the exam expects you to model explicitly:

Attribute	Why it matters
Type (string / file / tool-result / etc.)	Lets the runtime validate before the model ever sees it.
Provenance (user / retrieved / tool / system)	Lets the agent treat retrieved-content instructions as data, not commands — the basis of XPIA mitigation.

A search snippet is data, not an instruction, even if it contains the word "ignore previous instructions". Provenance is how the agent knows.

Outputs: a schema, not a vibe

The output contract is a typed schema — JSON, a function signature, a PR template. The point isn't formatting; it is that downstream code can:

Validate the output mechanically.
Reject malformed runs with a precise reason ("missing priority field"), not a shrug ("the model hallucinated").
Diff outputs across runs to spot regressions during evaluation.

// Example output contract for a triage agent
{
  "issue_id": "string",
  "priority": "low | medium | high | security",
  "rationale": "string (≤ 200 chars)",
  "suggested_assignee": "string | null",
  "needs_human_review": "boolean"
}

Success: programmatic, not vibes

A success criterion that you can't run in CI isn't a success criterion. The exam-grade pattern combines three layers:

Schema validity — does the output parse?
Task-specific assertions — did the triage label match the held-out set? Is the security constraint preserved?
Offline eval — aggregate accuracy on a frozen evaluation set, with a threshold the team agreed to.

Exam tip: if an option says "users haven't complained" or "the agent finished quickly", it's a distractor. The right answer always names a held-out set or an absolute constraint.

Quick check

1 of 3

+40 XP

Which of these is the **strongest** success criterion for an agent that triages incoming bug reports?

Pick your answer.

Where this shows up on the exam

Domain 1 and Domain 4 both lean on this skill. If you can read a scenario and immediately sketch the input schema, output schema, and the two-or-three programmatic checks you'd add to CI, you'll answer most "which option is the best success criterion?" questions correctly.

Anchor concepts

Key terms

Input contract: A typed schema describing every piece of data the agent is allowed to start with — including provenance and trust level.
Output contract: A typed schema for the agent's final artifact (e.g. a JSON object, a diff, a structured plan). Anything outside the schema is rejected.
Success criteria: Programmatic checks (not vibes) that decide whether a run met the goal. Usually a combination of schema-validity, task-specific assertions, and an offline eval.
Acceptance test: A specific, executable check derived from the success criteria — the agent's PR doesn't merge unless it passes.

Watch out

Common pitfalls

Defining success as 'the user didn't complain' — silent failures rot the agent's evaluation set and you'll never catch regressions.
Letting the output be free-form text when downstream code expects fields. Parse failures will be reported as 'the model hallucinated', when really the output contract was missing.
Treating provenance ('this came from the user' vs 'this came from a tool') as the same trust level — the basis of most prompt-injection incidents.