Agent Evaluations
Agent Evaluations let you define quality criteria for your agent's outputs, configure automatic retries when criteria aren't met, and sample production runs for ongoing quality monitoring. Evaluation criteria are configured per-agent from the Evaluations tab on the agent detail page.
Overview
The evaluation system has two components:
- Evaluation Criteria — Rules you define that specify what "good output" looks like for a step or the entire agent.
- Evaluation Results — Records created each time a criterion is evaluated against an agent run, including pass/fail status, scores, and whether retries were triggered.
You can define multiple criteria per agent. Each criterion targets a specific terminal step by step ID.
Criteria Types
Seclai supports three types of evaluation criteria, each serving a different purpose:
| Type | Purpose | Runs On |
|---|---|---|
| Output Expectation | Validate output format, schema, or content patterns | Every matching step output |
| Eval & Retry | Evaluate output and retry automatically on failure | Every matching step output |
| Sample & Flag | Periodically evaluate and flag runs for human review | Sampled runs (configurable) |
Output Expectations
Output Expectations validate that step outputs match expected formats, schemas, or content patterns. Use them to enforce structural requirements on your agent's outputs.
The Expectation Config is a JSON object that defines what to check:
{
"expected_format": "json",
"contains": ["summary", "recommendation"],
"custom_prompt": "The output must include a numbered list of at least 3 items."
}
| Field | Description |
|---|---|
expected_format | The expected output format (json, text, markdown, etc.) |
contains | Array of strings or keys that the output must contain |
custom_prompt | A natural-language prompt describing additional requirements |
Output expectations run on every matching step execution and produce a pass/fail result.
Eval & Retry
Eval & Retry criteria evaluate step outputs and automatically retry the step if evaluation fails. This is useful for steps where occasional failures are acceptable if a retry succeeds.
Key settings:
- Max Retries — Maximum number of retry attempts (1–10, default 3).
- Retry on Failure — Whether to actually trigger a retry, or just record the failure.
When an evaluation fails and retries are enabled, Seclai will:
- Record the failed evaluation result
- Re-execute the step with the same inputs
- Re-evaluate the new output
- Repeat until the evaluation passes or max retries is reached
Each retry is tracked in the evaluation results with an incrementing retry_count.
Sample & Flag
Sample & Flag criteria don't run on every execution. Instead, they evaluate a subset of runs based on a sample frequency and flag runs that fail evaluation for human review.
This is ideal for production monitoring — you get quality signals without adding latency to every run.
Flagged runs appear in the evaluation results with flagged: true and can be filtered in both the UI and API.
Configuring Evaluations
To configure evaluation criteria:
- Navigate to your agent's detail page
- Click the Evaluations tab
- Click Add Criteria
- Fill in the criteria details:
- Type — Choose Output Expectation, Eval & Retry, or Sample & Flag
- Step ID — Target the terminal step to evaluate
- Type-specific settings — Configure the expectation config, retry settings, or sample frequency
- Evaluation Prompt — (Optional) Custom prompt for LLM-based evaluation
- Click Create
You can enable/disable criteria at any time using the toggle on each criteria card without deleting them.
Evaluation Results
Each time a criterion is evaluated, an evaluation result is created. Results include:
| Field | Description |
|---|---|
status | passed, failed, pending, skipped, or error |
score | Numeric score (0.0–1.0) if applicable |
details | JSON object with evaluation-specific details |
retry_triggered | Whether this result caused a retry |
retry_count | Number of retries attempted so far |
flagged | Whether the run was flagged for review |
evaluated_at | Timestamp of the evaluation |
Results are shown in a unified table below the step cards and can be filtered by status, step, and timeframe.
Aggregated Results View
The results table is aggregated across criteria, so pagination and sorting are applied once across all matching rows. This makes it easier to triage failures by recency without jumping between criteria.
Filtering Results
The Evaluations tab provides multiple ways to filter and explore evaluation results.
Time Frame Selection
Filter results by time using the time frame selector:
- Relative — Last N minutes, hours, days, weeks, or months
- Absolute — Specific start and end dates
Sample Frequency
For Sample & Flag criteria, choose how often runs are evaluated:
| Frequency | Description |
|---|---|
| Every run | Evaluate every single run (highest visibility, highest overhead) |
| Every 5th run | Evaluate 1 in 5 runs |
| Every 10th run | Evaluate 1 in 10 runs |
| Every 25th run | Evaluate 1 in 25 runs |
| Every 50th run | Evaluate 1 in 50 runs |
| Every 100th run | Evaluate 1 in 100 runs (lowest overhead) |
Choose a frequency that balances visibility with performance for your use case.
Evaluation Prompt
Use a custom Evaluation Prompt to instruct the LLM evaluator. This prompt is sent along with the step output to determine pass/fail:
Evaluate whether the output meets the following criteria:
1. Contains a clear summary paragraph
2. Includes at least 3 actionable recommendations
3. Uses professional tone throughout
4. Does not contain placeholders or TODO items
Return PASS if all criteria are met, FAIL otherwise.
API Access
Evaluation criteria and results can be managed via the API using your API key.
List Evaluation Criteria
AGENT_ID=...
curl "https://api.seclai.com/agents/$AGENT_ID/evaluation-criteria" \
-H "X-API-Key: $SECLAI_API_KEY"
Create Evaluation Criteria
curl -X POST "https://api.seclai.com/agents/$AGENT_ID/evaluation-criteria" \
-H "X-API-Key: $SECLAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"description": "Output format check",
"step_id": "step_1",
"evaluation_mode": "output_expectation",
"expectation_config": {
"expected_format": "json",
"contains": ["result", "status"]
}
}'
Get Single Evaluation Criteria
CRITERIA_ID=...
curl "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID" \
-H "X-API-Key: $SECLAI_API_KEY"
Update Evaluation Criteria
curl -X PATCH "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID" \
-H "X-API-Key: $SECLAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"enabled": false
}'
Delete Evaluation Criteria
curl -X DELETE "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID" \
-H "X-API-Key: $SECLAI_API_KEY"
List Evaluation Results
curl "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID/results?status=failed&flagged_only=true&page=1&limit=20" \
-H "X-API-Key: $SECLAI_API_KEY"
List Aggregated Agent Evaluation Results
curl "https://api.seclai.com/agents/$AGENT_ID/evaluation-results?status=failed&step=step_1&page=1&limit=20" \
-H "X-API-Key: $SECLAI_API_KEY"
Create Evaluation Result
Record an evaluation result against a criteria (e.g. from an external test harness or CI pipeline):
curl -X POST "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID/results" \
-H "X-API-Key: $SECLAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"agent_run_id": "RUN_UUID",
"status": "passed",
"score": 0.95,
"details": { "reason": "Output matched expected JSON schema" },
"retry_triggered": false,
"retry_count": 0,
"flagged": false
}'
Get Run Evaluation Results
RUN_ID=...
curl "https://api.seclai.com/agents/$AGENT_ID/runs/$RUN_ID/evaluation-results" \
-H "X-API-Key: $SECLAI_API_KEY"
Get Evaluation Summary
curl "https://api.seclai.com/agents/evaluation-criteria/$CRITERIA_ID/summary" \
-H "X-API-Key: $SECLAI_API_KEY"
Summary Metrics
For account-level monitoring, use the summary endpoint:
curl "https://api.seclai.com/agents/evaluation-results/non-manual-summary?days=30" \
-H "X-API-Key: $SECLAI_API_KEY"
This includes only eval_and_retry and sample_and_flag criteria. In these summaries:
failedincludes bothfailedanderrorstatuses.flaggedcounts results explicitly marked for review.
MCP Tools
Evaluation criteria can also be managed through the MCP (Model Context Protocol) server. Connect your MCP client to your agent's MCP endpoint and use these tools:
| Tool | Description |
|---|---|
list_evaluation_criteria | List all evaluation criteria for the agent |
create_evaluation_criteria | Create a new evaluation criteria |
update_evaluation_criteria | Update an existing criteria |
get_evaluation_criteria | Fetch a single criteria by ID with result summary |
delete_evaluation_criteria | Delete a criteria and all its results |
list_evaluation_results | List evaluation results for a criteria with filtering |
create_evaluation_result | Record an evaluation result (status, score, details) |
get_evaluation_summary | Get aggregated summary (pass/fail/error/flagged counts) |
list_run_evaluation_results | List all evaluation results for a specific run |
list_agent_evaluation_results | List aggregated evaluation results for an agent with filters |
get_non_manual_evaluation_summary | Get account-level evaluation rollups |
Example MCP tool call:
{
"tool": "create_evaluation_criteria",
"arguments": {
"agent_id": "AGENT_UUID",
"step_id": "step_1",
"description": "JSON output validation",
"evaluation_mode": "output_expectation",
"expectation_config": {
"expected_format": "json"
},
"evaluation_prompt": "Verify the output is valid JSON with a 'result' key"
}
}
Exporting
Export an agent's evaluation results as a downloadable file. Supported formats are JSON, JSONL, and CSV.
- UI: Open the agent's Evaluations tab and click the Export button. You'll see an estimate of the record count before confirming.
- API:
POST /authenticated/resource-exportswithresource_type: "agent_evals"and the agent ID. - MCP: Use the
create_resource_exporttool withresource_type: "agent_evals".
Exports include evaluation status, scores, details, retry information, and timestamps for each result.
See Export Formats → Agent Evaluations for the full file schema and available filter options.
Next Steps
- Agents — Learn about creating and configuring agents
- Agent Steps — Understand agent step definitions and execution
- Alerts — Set up notifications for agent failures
- API Introduction — Get started with the Seclai API