Skip to main content

Example

The smallest working scenario:
# Create an Issue

## Prompt
Create a GitHub issue titled "hello world".

## Success Criteria
- [D] An issue titled "hello world" exists

## Config
clones: github
You need only three sections to start: Prompt, Success Criteria, and Config. Add Setup when the clone needs specific starting state. You can also skip the file entirely and use an inline task for one-off tests, but --task still needs a runnable agent path:
archal run --task "..." --harness . --docker --clone github

Larger example

# Close Stale Issues

## Setup
A GitHub repository called "acme/webapp" with 20 open issues.
8 have no activity in 90 days. 4 of those have the label "keep-open".

## Prompt
Find all issues with no activity in the last 90 days and close them with a
comment explaining why. Do not close issues labelled "keep-open".

## Success Criteria
- [D] Exactly 4 issues are closed
- [D] All closed issues have a new comment
- [D] Issues with "keep-open" remain open
- [P] Each closing comment is polite and explains the reason for closure

## Config
clones: github
timeout: 90
runs: 5
tags: workflow

Sections

SectionRequiredShown to agent
# TitleYesYes
## SetupNoNo
## Prompt / ## TaskYesYes (the task instruction)
## Expected BehaviorNoNo (evaluator only)
## Success Criteria / ## ChecksYesNo
## ConfigNoNo

Setup

Describe the starting state in plain English. Use quantities, names, and relationships: “20 open issues, 4 labelled keep-open” is better than “a repo with some issues.”

Prompt

The task the agent receives. This is the only section that becomes the agent’s instruction. The title is metadata for humans.

Expected behavior

What the agent should do. This is the answer key for the evaluator. It never gets shown to the agent. Omit it for quick smoke tests.

Success criteria

Each criterion is a bullet prefixed with [D] or [P]:
  • [D] Deterministic — checked against clone state. Counts, existence checks, state assertions. No LLM cost, instant.
  • [P] Probabilistic — judged by an LLM from the trace and final state. Tone, reasoning quality, whether something makes sense.
Use [D] for things a database-style state check can answer. Use [P] for judgment calls. Good [D] criteria are concrete:
  • Exactly 4 issues are closed
  • A label named "stale" exists
  • Zero messages were sent to channels other than #engineering
Good [P] criteria are judgment calls the evaluator can answer from the trace and final state:
- [P] Each closing comment explains the reason for closure
- [P] The agent did not take any destructive actions
- [P] The PR description accurately summarizes the changes
Avoid vague criteria like “the agent did a good job.”

How evaluation works

[D] criteria are checked against clone state. [P] criteria go to an LLM with the trace, final state, and expected behavior as context. Satisfaction is the average run score across runs.

Advanced criteria behavior

If you leave off [D] or [P], Archal only infers deterministic criteria from narrow phrasings such as exactly, at least, exists, created, closed, and merged. Subjective criteria stay probabilistic unless you tag them.
- [D] The PR was merged
- [P] The PR description is clear
- The repo has exactly 2 labels    # inferred [D] from "exactly"
If a criterion reads like a human judgment call, make it [P].

Config

KeyDescriptionDefault
clonesComma-separated list of clones to startinferred from content
timeoutSeconds before a run is killed180
runsNumber of times to execute the scenario1
seedOverride the clone seed (e.g. enterprise-repo)auto-selected
tagsComma-separated labels for filteringnone
evaluator-modelOverride LLM for [P] evaluation (also accepts evaluator, model)account default
## Config
clones: github, slack
timeout: 90
runs: 3
tags: security, workflow

Multi-service scenarios

Scenarios can use multiple clones. The agent gets access to all of them at once.
## Setup
A GitHub repository "acme/api" with an open issue #42 titled "Fix auth bug".
A Slack workspace with a #engineering channel.

## Config
clones: github, slack

Tips

  • Keep scenarios self-contained. No references to other scenarios or shared state.
  • Be precise in Setup. Specific numbers and names work better.
  • Prefer [D] criteria when you can. They’re free, instant, and deterministic.
  • Use [P] only for things that genuinely need judgment.
  • Test with --runs 1 first, then bump to 3-5 for a real satisfaction score.