Example
The smallest working scenario:Prompt, Success Criteria, and
Config. Add Setup when the clone needs specific starting state.
You can also skip the file entirely and use an inline task for one-off tests,
but --task still needs a runnable agent path:
Larger example
Sections
| Section | Required | Shown to agent |
|---|---|---|
# Title | Yes | Yes |
## Setup | No | No |
## Prompt / ## Task | Yes | Yes (the task instruction) |
## Expected Behavior | No | No (evaluator only) |
## Success Criteria / ## Checks | Yes | No |
## Config | No | No |
Setup
Describe the starting state in plain English. Use quantities, names, and relationships: “20 open issues, 4 labelled keep-open” is better than “a repo with some issues.”Prompt
The task the agent receives. This is the only section that becomes the agent’s instruction. The title is metadata for humans.Expected behavior
What the agent should do. This is the answer key for the evaluator. It never gets shown to the agent. Omit it for quick smoke tests.Success criteria
Each criterion is a bullet prefixed with[D] or [P]:
[D]Deterministic — checked against clone state. Counts, existence checks, state assertions. No LLM cost, instant.[P]Probabilistic — judged by an LLM from the trace and final state. Tone, reasoning quality, whether something makes sense.
[D] for things a database-style state check can answer. Use [P] for
judgment calls.
Good [D] criteria are concrete:
Exactly 4 issues are closedA label named "stale" existsZero messages were sent to channels other than #engineering
[P] criteria are judgment calls the evaluator can answer from the trace
and final state:
How evaluation works
[D] criteria are checked against clone state. [P] criteria go to an LLM with
the trace, final state, and expected behavior as context. Satisfaction is the
average run score across runs.
Advanced criteria behavior
If you leave off[D] or [P], Archal only infers deterministic criteria from
narrow phrasings such as exactly, at least, exists, created, closed,
and merged. Subjective criteria stay probabilistic unless you tag them.
[P].
Config
| Key | Description | Default |
|---|---|---|
clones | Comma-separated list of clones to start | inferred from content |
timeout | Seconds before a run is killed | 180 |
runs | Number of times to execute the scenario | 1 |
seed | Override the clone seed (e.g. enterprise-repo) | auto-selected |
tags | Comma-separated labels for filtering | none |
evaluator-model | Override LLM for [P] evaluation (also accepts evaluator, model) | account default |
Multi-service scenarios
Scenarios can use multiple clones. The agent gets access to all of them at once.Tips
- Keep scenarios self-contained. No references to other scenarios or shared state.
- Be precise in Setup. Specific numbers and names work better.
- Prefer
[D]criteria when you can. They’re free, instant, and deterministic. - Use
[P]only for things that genuinely need judgment. - Test with
--runs 1first, then bump to 3-5 for a real satisfaction score.
