Example
The smallest working scenario:## Task is an alias for ## Prompt. ## Checks is an alias for ## Success Criteria. Use whichever you prefer.
## Expected Behavior is optional. Archal derives it from your prompt and checks when omitted.
You can also skip the file entirely and use archal run --task "..." --twin github for one-off tests.
Full example
Sections
| Section | Required | Shown to agent |
|---|---|---|
# Title | Yes | Yes |
## Setup | No | Yes (as context) |
## Prompt / ## Task | Yes | Yes (the task instruction) |
## Expected Behavior | No | No (evaluator only) |
## Success Criteria / ## Checks | Yes | No |
## Config | No | No |
Setup
Describe the starting state of the twins in plain English. Archal uses this to generate seed state. Be specific about quantities, names, and relationships. “20 open issues, 4 labelled keep-open” is better than “a repo with some issues.”Prompt
The task the agent receives. This is the only section that becomes the agent’s instruction. The title is metadata for humans.Expected behavior
What the agent should do. This is the answer key for the evaluator. It never gets shown to the agent. Omit it for quick smoke tests.Success criteria
Each criterion is a bullet prefixed with[D] or [P]:
[D]Deterministic — checked against twin state. Counts, existence checks, state assertions. No LLM cost, instant.[P]Probabilistic — judged by an LLM from the trace and final state. Tone, reasoning quality, whether something makes sense.
[D]. Everything else defaults to [P].
| Pattern | Example |
|---|---|
Exactly N ... | Exactly 4 issues are closed |
At least N ... | At least 1 comment was posted |
Fewer than N ... | Fewer than 30 tool calls were made |
... is created/closed/merged | The issue is closed |
... exists | A label named "stale" exists |
Zero/None ... remain | Zero issues remain in Triage |
[P] instead of forcing it into [D]. The authoring UI treats subjective deterministic criteria as invalid unless you explicitly opt in with allow-unsafe-deterministic-criteria: true in ## Config.
Negative assertions check the agent didn’t do something harmful:
How evaluation works
After each run:- Collect the twin’s final state and the tool call trace
- Check
[D]criteria against state programmatically - Send
[P]criteria to an LLM with trace, state, and expected behavior as context - Score each criterion pass/fail
- Run score = fraction of criteria that passed
Config
| Key | Description | Default |
|---|---|---|
twins | Comma-separated list of twins to start | inferred from content |
timeout | Seconds before a run is killed | 180 |
runs | Number of times to execute the scenario | 1 |
seed | Override the twin seed (e.g. enterprise-repo) | auto-selected |
difficulty | easy, medium, or hard | none |
tags | Comma-separated labels for filtering | none |
evaluator-model | Override LLM for [P] evaluation (also accepts evaluator, model) | account default |
allow-unsafe-deterministic-criteria | Allow subjective [D] criteria in the draft UI | false |
Multi-service scenarios
Scenarios can use multiple twins. The agent gets MCP access to all of them at once.Tips
- Keep scenarios self-contained. No references to other scenarios or shared state.
- Be precise in Setup. Specific numbers and names produce better seeds.
- Prefer
[D]criteria when you can. They’re free, instant, and deterministic. - Use
[P]only for things that genuinely need judgment. - Test with
--runs 1first, then bump to 3-5 for a real satisfaction score.