# Agent Evaluations

Agent Evaluations let you define quality criteria for your agent's outputs, configure automatic retries when criteria aren't met, and sample production runs for ongoing quality monitoring. Evaluation criteria are configured per-agent from the **Evaluations** tab on the agent detail page.

> **Agent evaluations are one of three safety layers** in Seclai. They validate output quality, while the [prompt scanner](https://seclai.com/docs/prompt-scanner) blocks injection attacks at ingress, and [governance policies](https://seclai.com/docs/governance) screen content against your compliance rules. See the [Safety & Quality Overview](https://seclai.com/docs/safety-overview) for a comparison of all three.

## Overview

**Example scenarios where agent evaluations help:**

- A customer-facing Q&A agent sometimes returns vague one-sentence answers when the user's question deserves a detailed explanation. An **eval-and-retry** criterion requiring "at least 3 actionable points" catches thin answers and automatically retries for a better response.
- Your data transformation agent occasionally produces malformed JSON that breaks downstream systems. An **eval-and-retry** criterion validating output structure catches the error before delivery and retries the step.
- A weekly report agent has been running reliably for months but you want to know if quality degrades over time. A **sample-and-flag** criterion scoring "factual accuracy and completeness" monitors 20% of runs without adding any latency.
- A multi-step research agent produces a final summary that occasionally contradicts facts stated in earlier steps. An **eval-and-retry** criterion checking for internal consistency catches contradictions and re-generates the summary.
- Your content generation agent sometimes drifts off-topic when given ambiguous prompts. A **sample-and-flag** criterion for "relevance to the original request" surfaces these drift cases for human review so you can improve the prompt.

Governance policies answer "is this content _safe and compliant_?" — agent evaluations answer "is this content _good enough_?" They are complementary: an output can pass all governance policies (no PII, no harmful content) but still be unhelpful, incomplete, or off-topic.

---

The evaluation system has two components:

1. **Evaluation Criteria** — Rules you define that specify what "good output" looks like for a step or the entire agent.
2. **Evaluation Results** — Records created each time a criterion is evaluated against an agent run, including pass/fail status, scores, and whether retries were triggered.

You can define multiple criteria per agent. Each criterion targets a specific terminal step by step ID.

*Figure: Agent evaluation runs as a parallel side branch from the terminal step, without blocking delivery.*

## Performance & Cost Impact

Each evaluation mode has different implications for agent latency and credit usage:

| Mode                   | Latency impact                                                           | Credit cost                 |
| ---------------------- | ------------------------------------------------------------------------ | --------------------------- |
| **Output Expectation** | None — manual only (results are recorded via API, not run automatically) | None                        |
| **Eval & Retry**       | Adds evaluation time per step, plus retry time when failures occur       | Per evaluation + per retry  |
| **Sample & Flag**      | **No latency impact** — sampled evaluations run in the background        | Per sampled evaluation only |

**Eval & Retry** is the only mode that directly affects agent execution time. When a step fails evaluation and triggers a retry, the step is re-executed — so worst-case latency is `(1 + max_retries) × step_time + evaluation_time`. For most agents, this tradeoff is worthwhile: catching a bad output and auto-correcting it is far better than delivering it to users.

**Sample & Flag** is designed for zero-latency monitoring. Evaluations run asynchronously after the agent completes, so your agents stay fast while you still get quality signals.

> **Tip:** Use **Eval & Retry** for steps where output quality is critical (customer-facing content, data transformations, API responses). Use **Sample & Flag** for ongoing monitoring of steps that are generally reliable but need periodic spot-checks.

## Criteria Types

Seclai supports three types of evaluation criteria, each serving a different purpose:

| Type                   | Purpose                                              | Runs On                     |
| ---------------------- | ---------------------------------------------------- | --------------------------- |
| **Output Expectation** | Validate output format, schema, or content patterns  | Every matching step output  |
| **Eval & Retry**       | Evaluate output and retry automatically on failure   | Every matching step output  |
| **Sample & Flag**      | Periodically evaluate and flag runs for human review | Sampled runs (configurable) |

### Output Expectations

Output Expectations validate that step outputs match expected formats, schemas, or content patterns. Use them to enforce structural requirements on your agent's outputs.

The **Expectation Config** is a JSON object that defines what to check:

```json
{
  "expected_format": "json",
  "contains": ["summary", "recommendation"],
  "custom_prompt": "The output must include a numbered list of at least 3 items."
}
```

| Field                        | Description                                                                                    |
| ---------------------------- | ---------------------------------------------------------------------------------------------- |
| <code>expected_format</code> | The expected output format (<code>json</code>, <code>text</code>, <code>markdown</code>, etc.) |
| <code>contains</code>        | Array of strings or keys that the output must contain                                          |
| <code>custom_prompt</code>   | A natural-language prompt describing additional requirements                                   |

Output expectations run on **every** matching step execution and produce a pass/fail result.

### Eval & Retry

Eval & Retry criteria evaluate step outputs and automatically retry the step if evaluation fails. This is useful for steps where occasional failures are acceptable if a retry succeeds.

Key settings:

- **Max Retries** — Maximum number of retry attempts (1–10, default 3).
- **Retry on Failure** — Whether to actually trigger a retry, or just record the failure.

When an evaluation fails and retries are enabled, Seclai will:

1. Record the failed evaluation result
2. Re-execute the step with the same inputs
3. Re-evaluate the new output
4. Repeat until the evaluation passes or max retries is reached

Each retry is tracked in the evaluation results with an incrementing `retry_count`.

### Sample & Flag

Sample & Flag criteria don't run on every execution. Instead, they evaluate a subset of runs based on a **sample frequency** and flag runs that fail evaluation for human review.

This is ideal for production monitoring — you get quality signals without adding latency to every run.

Flagged runs appear in the evaluation results with `flagged: true` and can be filtered in both the UI and API.

## Configuring Evaluations

To configure evaluation criteria:

1. Navigate to your agent's detail page
2. Click the **Evaluations** tab
3. Click **Add Criteria**
4. Fill in the criteria details:
   - **Type** — Choose Output Expectation, Eval & Retry, or Sample & Flag

- **Step ID** — Target the terminal step to evaluate
- **Type-specific settings** — Configure the expectation config, retry settings, or sample frequency
- **Evaluation Prompt** — (Optional) Custom prompt for LLM-based evaluation

5. Click **Create**

You can enable/disable criteria at any time using the toggle on each criteria card without deleting them.

## Evaluation Results

Each time a criterion is evaluated, an evaluation result is created. Results include:

| Field                        | Description                                                                                                 |
| ---------------------------- | ----------------------------------------------------------------------------------------------------------- |
| <code>status</code>          | <code>passed</code>, <code>failed</code>, <code>pending</code>, <code>skipped</code>, or <code>error</code> |
| <code>score</code>           | Numeric score (0.0–1.0) if applicable                                                                       |
| <code>details</code>         | JSON object with evaluation-specific details                                                                |
| <code>retry_triggered</code> | Whether this result caused a retry                                                                          |
| <code>retry_count</code>     | Number of retries attempted so far                                                                          |
| <code>flagged</code>         | Whether the run was flagged for review                                                                      |
| <code>evaluated_at</code>    | Timestamp of the evaluation                                                                                 |

Results are shown in a unified table below the step cards and can be filtered by status, step, and timeframe.

### Aggregated Results View

The results table is aggregated across criteria, so pagination and sorting are applied once across all matching rows. This makes it easier to triage failures by recency without jumping between criteria.

## Filtering Results

The Evaluations tab provides multiple ways to filter and explore evaluation results.

### Time Frame Selection

Filter results by time using the time frame selector:

- **Relative** — Last N minutes, hours, days, weeks, or months
- **Absolute** — Specific start and end dates

### Sample Frequency

For Sample & Flag criteria, choose how often runs are evaluated:

| Frequency       | Description                                                      |
| --------------- | ---------------------------------------------------------------- |
| Every run       | Evaluate every single run (highest visibility, highest overhead) |
| Every 5th run   | Evaluate 1 in 5 runs                                             |
| Every 10th run  | Evaluate 1 in 10 runs                                            |
| Every 25th run  | Evaluate 1 in 25 runs                                            |
| Every 50th run  | Evaluate 1 in 50 runs                                            |
| Every 100th run | Evaluate 1 in 100 runs (lowest overhead)                         |

Choose a frequency that balances visibility with performance for your use case.

### Evaluation Prompt

Use a custom **Evaluation Prompt** to instruct the LLM evaluator. This prompt is sent along with the step output to determine pass/fail:

```
Evaluate whether the output meets the following criteria:
1. Contains a clear summary paragraph
2. Includes at least 3 actionable recommendations
3. Uses professional tone throughout
4. Does not contain placeholders or TODO items

Return PASS if all criteria are met, FAIL otherwise.
```

## API Access

Evaluation criteria and results can be managed via the API using your API key.

### List Evaluation Criteria

```bash
AGENT_ID=...

curl "https://api.seclai.com/agents/$AGENT_ID/evaluation-criteria" \
  -H "X-API-Key: $SECLAI_API_KEY"
```

### Create Evaluation Criteria

```bash
curl -X POST "https://api.seclai.com/agents/$AGENT_ID/evaluation-criteria" \
  -H "X-API-Key: $SECLAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "description": "Output format check",
    "step_id": "step_1",
    "evaluation_mode": "output_expectation",
    "expectation_config": {
      "expected_format": "json",
      "contains": ["result", "status"]
    }
  }'
```

### Get Single Evaluation Criteria

```bash
CRITERIA_ID=...

curl "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID" \
  -H "X-API-Key: $SECLAI_API_KEY"
```

### Update Evaluation Criteria

```bash
curl -X PATCH "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID" \
  -H "X-API-Key: $SECLAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "enabled": false
  }'
```

### Delete Evaluation Criteria

```bash
curl -X DELETE "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID" \
  -H "X-API-Key: $SECLAI_API_KEY"
```

### List Evaluation Results

```bash
curl "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID/results?status=failed&flagged_only=true&page=1&limit=20" \
  -H "X-API-Key: $SECLAI_API_KEY"
```

### List Aggregated Agent Evaluation Results

```bash
curl "https://api.seclai.com/agents/$AGENT_ID/evaluation-results?status=failed&step=step_1&page=1&limit=20" \
  -H "X-API-Key: $SECLAI_API_KEY"
```

### Create Evaluation Result

Record an evaluation result against a criteria (e.g. from an external test harness or CI pipeline):

```bash
curl -X POST "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID/results" \
  -H "X-API-Key: $SECLAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "agent_run_id": "RUN_UUID",
    "status": "passed",
    "score": 0.95,
    "details": { "reason": "Output matched expected JSON schema" },
    "retry_triggered": false,
    "retry_count": 0,
    "flagged": false
  }'
```

### Get Run Evaluation Results

```bash
RUN_ID=...

curl "https://api.seclai.com/agents/$AGENT_ID/runs/$RUN_ID/evaluation-results" \
  -H "X-API-Key: $SECLAI_API_KEY"
```

### Get Evaluation Summary

```bash
curl "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID/summary" \
  -H "X-API-Key: $SECLAI_API_KEY"
```

### Summary Metrics

For account-level monitoring, use the summary endpoint:

```bash
curl "https://api.seclai.com/agents/evaluation-results/non-manual-summary?days=30" \
  -H "X-API-Key: $SECLAI_API_KEY"
```

This includes only `eval_and_retry` and `sample_and_flag` criteria. In these summaries:

- `failed` includes both `failed` and `error` statuses.
- `flagged` counts results explicitly marked for review.

## MCP Tools

Evaluation criteria can also be managed through the MCP (Model Context Protocol) server. Connect your MCP client to your agent's MCP endpoint and use these tools:

| Tool                                           | Description                                                  |
| ---------------------------------------------- | ------------------------------------------------------------ |
| <code>list_evaluation_criteria</code>          | List all evaluation criteria for the agent                   |
| <code>create_evaluation_criteria</code>        | Create a new evaluation criteria                             |
| <code>update_evaluation_criteria</code>        | Update an existing criteria                                  |
| <code>get_evaluation_criteria</code>           | Fetch a single criteria by ID with result summary            |
| <code>delete_evaluation_criteria</code>        | Delete a criteria and all its results                        |
| <code>list_evaluation_results</code>           | List evaluation results for a criteria with filtering        |
| <code>create_evaluation_result</code>          | Record an evaluation result (status, score, details)         |
| <code>get_evaluation_summary</code>            | Get aggregated summary (pass/fail/error/flagged counts)      |
| <code>list_run_evaluation_results</code>       | List all evaluation results for a specific run               |
| <code>list_agent_evaluation_results</code>     | List aggregated evaluation results for an agent with filters |
| <code>get_non_manual_evaluation_summary</code> | Get account-level evaluation rollups                         |

**Example MCP tool call:**

```json
{
  "tool": "create_evaluation_criteria",
  "arguments": {
    "agent_id": "AGENT_UUID",
    "step_id": "step_1",
    "description": "JSON output validation",
    "evaluation_mode": "output_expectation",
    "expectation_config": {
      "expected_format": "json"
    },
    "evaluation_prompt": "Verify the output is valid JSON with a 'result' key"
  }
}
```

## Exporting

Export an agent's evaluation results as a downloadable file. Supported formats are **JSON**, **JSONL**, and **CSV**.

- **UI:** Open the agent's **Evaluations** tab and click the **Export** button. You'll see an estimate of the record count before confirming.
- **API:** `POST /authenticated/resource-exports` with `resource_type: "agent_evals"` and the agent ID.
- **MCP:** Use the `create_resource_export` tool with `resource_type: "agent_evals"`.

Exports include evaluation status, scores, details, retry information, and timestamps for each result.

See [Export Formats → Agent Evaluations](https://seclai.com/docs/export-formats#agent-evaluations) for the full file schema and available filter options.

## Next Steps

- [Agents](https://seclai.com/docs/agents) — Learn about creating and configuring agents
- [Agent Steps](https://seclai.com/docs/agent-steps) — Understand agent step definitions and execution
- [Alerts](https://seclai.com/docs/alerts) — Set up notifications for agent failures
- [API Introduction](https://seclai.com/docs/api-introduction) — Get started with the Seclai API
