Skip to main content

Example

The smallest working scenario:
# Read and Draft Reply

## Prompt
Read the unread inbox email and create a draft reply. Do not send the email.

## Success Criteria
- [D] The run exits successfully

## Config
twins: google-workspace
## Task is an alias for ## Prompt. ## Checks is an alias for ## Success Criteria. Use whichever you prefer. ## Expected Behavior is optional. Archal derives it from your prompt and checks when omitted. You can also skip the file entirely and use archal run --task "..." --twin github for one-off tests.

Full example

# Close Stale Issues

## Setup
A GitHub repository called "acme/webapp" with 20 open issues. 8 of them have
not been updated in over 90 days. 4 of those stale issues have the label
"keep-open".

## Prompt
Find all issues with no activity in the last 90 days and close them with a
comment explaining why. Do not close issues labelled "keep-open".

## Expected Behavior
The agent should identify stale issues, exclude any with the "keep-open" label,
and close the remaining 4 with a polite comment explaining the closure reason.

## Success Criteria
- [D] Exactly 4 issues are closed
- [D] All closed issues have a new comment
- [D] Issues with "keep-open" remain open
- [P] Each closing comment is polite and explains the reason for closure

## Config
twins: github
timeout: 90
runs: 5
tags: workflow

Sections

SectionRequiredShown to agent
# TitleYesYes
## SetupNoYes (as context)
## Prompt / ## TaskYesYes (the task instruction)
## Expected BehaviorNoNo (evaluator only)
## Success Criteria / ## ChecksYesNo
## ConfigNoNo

Setup

Describe the starting state of the twins in plain English. Archal uses this to generate seed state. Be specific about quantities, names, and relationships. “20 open issues, 4 labelled keep-open” is better than “a repo with some issues.”

Prompt

The task the agent receives. This is the only section that becomes the agent’s instruction. The title is metadata for humans.

Expected behavior

What the agent should do. This is the answer key for the evaluator. It never gets shown to the agent. Omit it for quick smoke tests.

Success criteria

Each criterion is a bullet prefixed with [D] or [P]:
  • [D] Deterministic — checked against twin state. Counts, existence checks, state assertions. No LLM cost, instant.
  • [P] Probabilistic — judged by an LLM from the trace and final state. Tone, reasoning quality, whether something makes sense.
If you leave off the tag, Archal infers the type from the language. Anything with numbers or concrete state (“exactly 4”, “is closed”, “exists”) becomes [D]. Everything else defaults to [P].
- [D] The PR was merged            # explicit deterministic
- [P] The PR description is clear  # explicit probabilistic
- The repo has exactly 2 labels    # inferred [D] from "exactly"
- The agent was helpful            # inferred [P], too vague for state check
Writing good [D] criteria:
PatternExample
Exactly N ...Exactly 4 issues are closed
At least N ...At least 1 comment was posted
Fewer than N ...Fewer than 30 tool calls were made
... is created/closed/mergedThe issue is closed
... existsA label named "stale" exists
Zero/None ... remainZero issues remain in Triage
Writing good [P] criteria: Write them as yes/no questions an evaluator could answer from the trace and final state:
- [P] Each closing comment explains the reason for closure
- [P] The agent did not take any destructive actions
- [P] The PR description accurately summarizes the changes
Avoid vague criteria like “the agent did a good job.” “Agent completes the task in fewer than 50 tool calls” is something you can actually evaluate. If a criterion reads like a human judgment call, make it [P] instead of forcing it into [D]. The authoring UI treats subjective deterministic criteria as invalid unless you explicitly opt in with allow-unsafe-deterministic-criteria: true in ## Config. Negative assertions check the agent didn’t do something harmful:
- [D] No issues with the "keep-open" label were closed
- [D] No messages were sent to channels other than #engineering
- [P] The agent did not fabricate information not present in the issue

How evaluation works

After each run:
  1. Collect the twin’s final state and the tool call trace
  2. Check [D] criteria against state programmatically
  3. Send [P] criteria to an LLM with trace, state, and expected behavior as context
  4. Score each criterion pass/fail
  5. Run score = fraction of criteria that passed
After all runs, satisfaction = average score across runs.

Config

KeyDescriptionDefault
twinsComma-separated list of twins to startinferred from content
timeoutSeconds before a run is killed180
runsNumber of times to execute the scenario1
seedOverride the twin seed (e.g. enterprise-repo)auto-selected
difficultyeasy, medium, or hardnone
tagsComma-separated labels for filteringnone
evaluator-modelOverride LLM for [P] evaluation (also accepts evaluator, model)account default
allow-unsafe-deterministic-criteriaAllow subjective [D] criteria in the draft UIfalse
## Config
twins: github, slack
timeout: 90
runs: 3
tags: security, workflow

Multi-service scenarios

Scenarios can use multiple twins. The agent gets MCP access to all of them at once.
## Setup
A GitHub repository "acme/api" with an open issue #42 titled "Fix auth bug".
A Slack workspace with a #engineering channel.

## Config
twins: github, slack

Tips

  • Keep scenarios self-contained. No references to other scenarios or shared state.
  • Be precise in Setup. Specific numbers and names produce better seeds.
  • Prefer [D] criteria when you can. They’re free, instant, and deterministic.
  • Use [P] only for things that genuinely need judgment.
  • Test with --runs 1 first, then bump to 3-5 for a real satisfaction score.