Documentation

Agent Evaluations

Agent Evaluations let you define quality criteria for your agent's outputs, configure automatic retries when criteria aren't met, and sample production runs for ongoing quality monitoring. Evaluation criteria are configured per-agent from the Evaluations tab on the agent detail page.

Overview

The evaluation system has two components:

  1. Evaluation Criteria — Rules you define that specify what "good output" looks like for a step or the entire agent.
  2. Evaluation Results — Records created each time a criterion is evaluated against an agent run, including pass/fail status, scores, and whether retries were triggered.

You can define multiple criteria per agent. Each criterion targets a specific terminal step by step ID.

Criteria Types

Seclai supports three types of evaluation criteria, each serving a different purpose:

TypePurposeRuns On
Output ExpectationValidate output format, schema, or content patternsEvery matching step output
Eval & RetryEvaluate output and retry automatically on failureEvery matching step output
Sample & FlagPeriodically evaluate and flag runs for human reviewSampled runs (configurable)

Output Expectations

Output Expectations validate that step outputs match expected formats, schemas, or content patterns. Use them to enforce structural requirements on your agent's outputs.

The Expectation Config is a JSON object that defines what to check:

{
  "expected_format": "json",
  "contains": ["summary", "recommendation"],
  "custom_prompt": "The output must include a numbered list of at least 3 items."
}
FieldDescription
expected_formatThe expected output format (json, text, markdown, etc.)
containsArray of strings or keys that the output must contain
custom_promptA natural-language prompt describing additional requirements

Output expectations run on every matching step execution and produce a pass/fail result.

Eval & Retry

Eval & Retry criteria evaluate step outputs and automatically retry the step if evaluation fails. This is useful for steps where occasional failures are acceptable if a retry succeeds.

Key settings:

  • Max Retries — Maximum number of retry attempts (1–10, default 3).
  • Retry on Failure — Whether to actually trigger a retry, or just record the failure.

When an evaluation fails and retries are enabled, Seclai will:

  1. Record the failed evaluation result
  2. Re-execute the step with the same inputs
  3. Re-evaluate the new output
  4. Repeat until the evaluation passes or max retries is reached

Each retry is tracked in the evaluation results with an incrementing retry_count.

Sample & Flag

Sample & Flag criteria don't run on every execution. Instead, they evaluate a subset of runs based on a sample frequency and flag runs that fail evaluation for human review.

This is ideal for production monitoring — you get quality signals without adding latency to every run.

Flagged runs appear in the evaluation results with flagged: true and can be filtered in both the UI and API.

Configuring Evaluations

To configure evaluation criteria:

  1. Navigate to your agent's detail page
  2. Click the Evaluations tab
  3. Click Add Criteria
  4. Fill in the criteria details:
    • Type — Choose Output Expectation, Eval & Retry, or Sample & Flag
  • Step ID — Target the terminal step to evaluate
  • Type-specific settings — Configure the expectation config, retry settings, or sample frequency
  • Evaluation Prompt — (Optional) Custom prompt for LLM-based evaluation
  1. Click Create

You can enable/disable criteria at any time using the toggle on each criteria card without deleting them.

Evaluation Results

Each time a criterion is evaluated, an evaluation result is created. Results include:

FieldDescription
statuspassed, failed, pending, skipped, or error
scoreNumeric score (0.0–1.0) if applicable
detailsJSON object with evaluation-specific details
retry_triggeredWhether this result caused a retry
retry_countNumber of retries attempted so far
flaggedWhether the run was flagged for review
evaluated_atTimestamp of the evaluation

Results are shown in a unified table below the step cards and can be filtered by status, step, and timeframe.

Aggregated Results View

The results table is aggregated across criteria, so pagination and sorting are applied once across all matching rows. This makes it easier to triage failures by recency without jumping between criteria.

Filtering Results

The Evaluations tab provides multiple ways to filter and explore evaluation results.

Time Frame Selection

Filter results by time using the time frame selector:

  • Relative — Last N minutes, hours, days, weeks, or months
  • Absolute — Specific start and end dates

Sample Frequency

For Sample & Flag criteria, choose how often runs are evaluated:

FrequencyDescription
Every runEvaluate every single run (highest visibility, highest overhead)
Every 5th runEvaluate 1 in 5 runs
Every 10th runEvaluate 1 in 10 runs
Every 25th runEvaluate 1 in 25 runs
Every 50th runEvaluate 1 in 50 runs
Every 100th runEvaluate 1 in 100 runs (lowest overhead)

Choose a frequency that balances visibility with performance for your use case.

Evaluation Prompt

Use a custom Evaluation Prompt to instruct the LLM evaluator. This prompt is sent along with the step output to determine pass/fail:

Evaluate whether the output meets the following criteria:
1. Contains a clear summary paragraph
2. Includes at least 3 actionable recommendations
3. Uses professional tone throughout
4. Does not contain placeholders or TODO items

Return PASS if all criteria are met, FAIL otherwise.

API Access

Evaluation criteria and results can be managed via the API using your API key.

List Evaluation Criteria

AGENT_ID=...

curl "https://api.seclai.com/agents/$AGENT_ID/evaluation-criteria" \
  -H "X-API-Key: $SECLAI_API_KEY"

Create Evaluation Criteria

curl -X POST "https://api.seclai.com/agents/$AGENT_ID/evaluation-criteria" \
  -H "X-API-Key: $SECLAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "description": "Output format check",
    "step_id": "step_1",
    "evaluation_mode": "output_expectation",
    "expectation_config": {
      "expected_format": "json",
      "contains": ["result", "status"]
    }
  }'

Get Single Evaluation Criteria

CRITERIA_ID=...

curl "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID" \
  -H "X-API-Key: $SECLAI_API_KEY"

Update Evaluation Criteria

curl -X PATCH "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID" \
  -H "X-API-Key: $SECLAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "enabled": false
  }'

Delete Evaluation Criteria

curl -X DELETE "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID" \
  -H "X-API-Key: $SECLAI_API_KEY"

List Evaluation Results

curl "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID/results?status=failed&flagged_only=true&page=1&limit=20" \
  -H "X-API-Key: $SECLAI_API_KEY"

List Aggregated Agent Evaluation Results

curl "https://api.seclai.com/agents/$AGENT_ID/evaluation-results?status=failed&step=step_1&page=1&limit=20" \
  -H "X-API-Key: $SECLAI_API_KEY"

Create Evaluation Result

Record an evaluation result against a criteria (e.g. from an external test harness or CI pipeline):

curl -X POST "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID/results" \
  -H "X-API-Key: $SECLAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "agent_run_id": "RUN_UUID",
    "status": "passed",
    "score": 0.95,
    "details": { "reason": "Output matched expected JSON schema" },
    "retry_triggered": false,
    "retry_count": 0,
    "flagged": false
  }'

Get Run Evaluation Results

RUN_ID=...

curl "https://api.seclai.com/agents/$AGENT_ID/runs/$RUN_ID/evaluation-results" \
  -H "X-API-Key: $SECLAI_API_KEY"

Get Evaluation Summary

curl "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID/summary" \
  -H "X-API-Key: $SECLAI_API_KEY"

Summary Metrics

For account-level monitoring, use the summary endpoint:

curl "https://api.seclai.com/agents/evaluation-results/non-manual-summary?days=30" \
  -H "X-API-Key: $SECLAI_API_KEY"

This includes only eval_and_retry and sample_and_flag criteria. In these summaries:

  • failed includes both failed and error statuses.
  • flagged counts results explicitly marked for review.

MCP Tools

Evaluation criteria can also be managed through the MCP (Model Context Protocol) server. Connect your MCP client to your agent's MCP endpoint and use these tools:

ToolDescription
list_evaluation_criteriaList all evaluation criteria for the agent
create_evaluation_criteriaCreate a new evaluation criteria
update_evaluation_criteriaUpdate an existing criteria
get_evaluation_criteriaFetch a single criteria by ID with result summary
delete_evaluation_criteriaDelete a criteria and all its results
list_evaluation_resultsList evaluation results for a criteria with filtering
create_evaluation_resultRecord an evaluation result (status, score, details)
get_evaluation_summaryGet aggregated summary (pass/fail/error/flagged counts)
list_run_evaluation_resultsList all evaluation results for a specific run
list_agent_evaluation_resultsList aggregated evaluation results for an agent with filters
get_non_manual_evaluation_summaryGet account-level evaluation rollups

Example MCP tool call:

{
  "tool": "create_evaluation_criteria",
  "arguments": {
    "agent_id": "AGENT_UUID",
    "step_id": "step_1",
    "description": "JSON output validation",
    "evaluation_mode": "output_expectation",
    "expectation_config": {
      "expected_format": "json"
    },
    "evaluation_prompt": "Verify the output is valid JSON with a 'result' key"
  }
}

Exporting

Export an agent's evaluation results as a downloadable file. Supported formats are JSON, JSONL, and CSV.

  • UI: Open the agent's Evaluations tab and click the Export button. You'll see an estimate of the record count before confirming.
  • API: POST /authenticated/resource-exports with resource_type: "agent_evals" and the agent ID.
  • MCP: Use the create_resource_export tool with resource_type: "agent_evals".

Exports include evaluation status, scores, details, retry information, and timestamps for each result.

See Export Formats → Agent Evaluations for the full file schema and available filter options.

Next Steps

  • Agents — Learn about creating and configuring agents
  • Agent Steps — Understand agent step definitions and execution
  • Alerts — Set up notifications for agent failures
  • API Introduction — Get started with the Seclai API