Agent Evaluations

Agent Evaluations let you define quality criteria for your agent's outputs, configure automatic retries when criteria aren't met, and sample production runs for ongoing quality monitoring. Evaluation criteria are configured per-agent from the Evaluations tab on the agent detail page.

Agent evaluations are one of three safety layers in Seclai. They validate output quality, while the prompt scanner blocks injection attacks at ingress, and governance policies screen content against your compliance rules. See the Safety & Quality Overview for a comparison of all three.

Overview

Example scenarios where agent evaluations help:

A customer-facing Q&A agent sometimes returns vague one-sentence answers when the user's question deserves a detailed explanation. An eval-and-retry criterion requiring "at least 3 actionable points" catches thin answers and automatically retries for a better response.
Your data transformation agent occasionally produces malformed JSON that breaks downstream systems. An eval-and-retry criterion validating output structure catches the error before delivery and retries the step.
A weekly report agent has been running reliably for months but you want to know if quality degrades over time. A sample-and-flag criterion scoring "factual accuracy and completeness" monitors 20% of runs without adding any latency.
A multi-step research agent produces a final summary that occasionally contradicts facts stated in earlier steps. An eval-and-retry criterion checking for internal consistency catches contradictions and re-generates the summary.
Your content generation agent sometimes drifts off-topic when given ambiguous prompts. A sample-and-flag criterion for "relevance to the original request" surfaces these drift cases for human review so you can improve the prompt.

Governance policies answer "is this content safe and compliant?" — agent evaluations answer "is this content good enough?" They are complementary: an output can pass all governance policies (no PII, no harmful content) but still be unhelpful, incomplete, or off-topic.

The evaluation system has two components:

Evaluation Criteria — Rules you define that specify what "good output" looks like for a step or the entire agent.
Evaluation Results — Records created each time a criterion is evaluated against an agent run, including pass/fail status, scores, and whether retries were triggered.

You can define multiple criteria per agent. Each criterion targets a specific terminal step by step ID.

Figure 1.Agent evaluation runs as a parallel side branch from the terminal step, without blocking delivery.

Performance & Cost Impact

Each evaluation mode has different implications for agent latency and credit usage:

Mode	Latency impact	Credit cost
Output Expectation	None — manual only (results are recorded via API, not run automatically)	None
Eval & Retry	Adds evaluation time per step, plus retry time when failures occur	Per evaluation + per retry
Sample & Flag	No latency impact — sampled evaluations run in the background	Per sampled evaluation only

Eval & Retry is the only mode that directly affects agent execution time. When a step fails evaluation and triggers a retry, the step is re-executed — so worst-case latency is (1 + max_retries) × step_time + evaluation_time. For most agents, this tradeoff is worthwhile: catching a bad output and auto-correcting it is far better than delivering it to users.

Sample & Flag is designed for zero-latency monitoring. Evaluations run asynchronously after the agent completes, so your agents stay fast while you still get quality signals.

Tip: Use Eval & Retry for steps where output quality is critical (customer-facing content, data transformations, API responses). Use Sample & Flag for ongoing monitoring of steps that are generally reliable but need periodic spot-checks.

Criteria Types

Seclai supports three types of evaluation criteria, each serving a different purpose:

Type	Purpose	Runs On
Output Expectation	Validate output format, schema, or content patterns	Every matching step output
Eval & Retry	Evaluate output and retry automatically on failure	Every matching step output
Sample & Flag	Periodically evaluate and flag runs for human review	Sampled runs (configurable)

Output Expectations

Output Expectations validate that step outputs match expected formats, schemas, or content patterns. Use them to enforce structural requirements on your agent's outputs.

The Expectation Config is a JSON object that defines what to check:

{
  "expected_format": "json",
  "contains": ["summary", "recommendation"],
  "custom_prompt": "The output must include a numbered list of at least 3 items."
}

Field	Description
`expected_format`	The expected output format (`json`, `text`, `markdown`, etc.)
`contains`	Array of strings or keys that the output must contain
`custom_prompt`	A natural-language prompt describing additional requirements

Output expectations run on every matching step execution and produce a pass/fail result.

Eval & Retry

Eval & Retry criteria evaluate step outputs and automatically retry the step if evaluation fails. This is useful for steps where occasional failures are acceptable if a retry succeeds.

Key settings:

Max Retries — Maximum number of retry attempts (1–10, default 3).
Retry on Failure — Whether to actually trigger a retry, or just record the failure.

When an evaluation fails and retries are enabled, Seclai will:

Record the failed evaluation result
Re-execute the step with the same inputs
Re-evaluate the new output
Repeat until the evaluation passes or max retries is reached

Each retry is tracked in the evaluation results with an incrementing retry_count.

Sample & Flag

Sample & Flag criteria don't run on every execution. Instead, they evaluate a subset of runs based on a sample frequency and flag runs that fail evaluation for human review.

This is ideal for production monitoring — you get quality signals without adding latency to every run.

Flagged runs appear in the evaluation results with flagged: true and can be filtered in both the UI and API.

Configuring Evaluations

To configure evaluation criteria:

Navigate to your agent's detail page
Click the Evaluations tab
Click Add Criteria
Fill in the criteria details:
- Type — Choose Output Expectation, Eval & Retry, or Sample & Flag

Step ID — Target the terminal step to evaluate
Type-specific settings — Configure the expectation config, retry settings, or sample frequency
Evaluation Prompt — (Optional) Custom prompt for LLM-based evaluation

Click Create

You can enable/disable criteria at any time using the toggle on each criteria card without deleting them.

Evaluation Results

Each time a criterion is evaluated, an evaluation result is created. Results include:

Field	Description
`status`	`passed`, `failed`, `pending`, `skipped`, or `error`
`score`	Numeric score (0.0–1.0) if applicable
`details`	JSON object with evaluation-specific details
`retry_triggered`	Whether this result caused a retry
`retry_count`	Number of retries attempted so far
`flagged`	Whether the run was flagged for review
`evaluated_at`	Timestamp of the evaluation

Results are shown in a unified table below the step cards and can be filtered by status, step, and timeframe.

Aggregated Results View

The results table is aggregated across criteria, so pagination and sorting are applied once across all matching rows. This makes it easier to triage failures by recency without jumping between criteria.

Filtering Results

The Evaluations tab provides multiple ways to filter and explore evaluation results.

Time Frame Selection

Filter results by time using the time frame selector:

Relative — Last N minutes, hours, days, weeks, or months
Absolute — Specific start and end dates

Sample Frequency

For Sample & Flag criteria, choose how often runs are evaluated:

Frequency	Description
Every run	Evaluate every single run (highest visibility, highest overhead)
Every 5th run	Evaluate 1 in 5 runs
Every 10th run	Evaluate 1 in 10 runs
Every 25th run	Evaluate 1 in 25 runs
Every 50th run	Evaluate 1 in 50 runs
Every 100th run	Evaluate 1 in 100 runs (lowest overhead)

Choose a frequency that balances visibility with performance for your use case.

Evaluation Prompt

Use a custom Evaluation Prompt to instruct the LLM evaluator. This prompt is sent along with the step output to determine pass/fail:

Evaluate whether the output meets the following criteria:
1. Contains a clear summary paragraph
2. Includes at least 3 actionable recommendations
3. Uses professional tone throughout
4. Does not contain placeholders or TODO items

Return PASS if all criteria are met, FAIL otherwise.

API Access

Evaluation criteria and results can be managed via the API using your API key.

List Evaluation Criteria

AGENT_ID=...

curl "https://api.seclai.com/agents/$AGENT_ID/evaluation-criteria" \
  -H "X-API-Key: $SECLAI_API_KEY"

Create Evaluation Criteria

curl -X POST "https://api.seclai.com/agents/$AGENT_ID/evaluation-criteria" \
  -H "X-API-Key: $SECLAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "description": "Output format check",
    "step_id": "step_1",
    "evaluation_mode": "output_expectation",
    "expectation_config": {
      "expected_format": "json",
      "contains": ["result", "status"]
    }
  }'

Get Single Evaluation Criteria

CRITERIA_ID=...

curl "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID" \
  -H "X-API-Key: $SECLAI_API_KEY"

Update Evaluation Criteria

curl -X PATCH "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID" \
  -H "X-API-Key: $SECLAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "enabled": false
  }'

Delete Evaluation Criteria

curl -X DELETE "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID" \
  -H "X-API-Key: $SECLAI_API_KEY"

List Evaluation Results

curl "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID/results?status=failed&flagged_only=true&page=1&limit=20" \
  -H "X-API-Key: $SECLAI_API_KEY"

List Aggregated Agent Evaluation Results

curl "https://api.seclai.com/agents/$AGENT_ID/evaluation-results?status=failed&step=step_1&page=1&limit=20" \
  -H "X-API-Key: $SECLAI_API_KEY"

Create Evaluation Result

Record an evaluation result against a criteria (e.g. from an external test harness or CI pipeline):

curl -X POST "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID/results" \
  -H "X-API-Key: $SECLAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "agent_run_id": "RUN_UUID",
    "status": "passed",
    "score": 0.95,
    "details": { "reason": "Output matched expected JSON schema" },
    "retry_triggered": false,
    "retry_count": 0,
    "flagged": false
  }'

Get Run Evaluation Results

RUN_ID=...

curl "https://api.seclai.com/agents/$AGENT_ID/runs/$RUN_ID/evaluation-results" \
  -H "X-API-Key: $SECLAI_API_KEY"

Get Evaluation Summary

curl "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID/summary" \
  -H "X-API-Key: $SECLAI_API_KEY"

Summary Metrics

For account-level monitoring, use the summary endpoint:

curl "https://api.seclai.com/agents/evaluation-results/non-manual-summary?days=30" \
  -H "X-API-Key: $SECLAI_API_KEY"

This includes only eval_and_retry and sample_and_flag criteria. In these summaries:

failed includes both failed and error statuses.
flagged counts results explicitly marked for review.

MCP Tools

Evaluation criteria can also be managed through the MCP (Model Context Protocol) server. Connect your MCP client to your agent's MCP endpoint and use these tools:

Tool	Description
`list_evaluation_criteria`	List all evaluation criteria for the agent
`create_evaluation_criteria`	Create a new evaluation criteria
`update_evaluation_criteria`	Update an existing criteria
`get_evaluation_criteria`	Fetch a single criteria by ID with result summary
`delete_evaluation_criteria`	Delete a criteria and all its results
`list_evaluation_results`	List evaluation results for a criteria with filtering
`create_evaluation_result`	Record an evaluation result (status, score, details)
`get_evaluation_summary`	Get aggregated summary (pass/fail/error/flagged counts)
`list_run_evaluation_results`	List all evaluation results for a specific run
`list_agent_evaluation_results`	List aggregated evaluation results for an agent with filters
`get_non_manual_evaluation_summary`	Get account-level evaluation rollups

Example MCP tool call:

{
  "tool": "create_evaluation_criteria",
  "arguments": {
    "agent_id": "AGENT_UUID",
    "step_id": "step_1",
    "description": "JSON output validation",
    "evaluation_mode": "output_expectation",
    "expectation_config": {
      "expected_format": "json"
    },
    "evaluation_prompt": "Verify the output is valid JSON with a 'result' key"
  }
}

Exporting

Export an agent's evaluation results as a downloadable file. Supported formats are JSON, JSONL, and CSV.

UI: Open the agent's Evaluations tab and click the Export button. You'll see an estimate of the record count before confirming.
API: POST /authenticated/resource-exports with resource_type: "agent_evals" and the agent ID.
MCP: Use the create_resource_export tool with resource_type: "agent_evals".

Exports include evaluation status, scores, details, retry information, and timestamps for each result.

See Export Formats → Agent Evaluations for the full file schema and available filter options.

Next Steps

Agents — Learn about creating and configuring agents
Agent Steps — Understand agent step definitions and execution
Alerts — Set up notifications for agent failures
API Introduction — Get started with the Seclai API