Agent Streaming
The streaming_result step type enables real-time token-by-token delivery of LLM output to clients via Server-Sent Events (SSE). This is ideal for chat-like interfaces where users should see the response being generated incrementally.
Requirements:
- The agent must use a
dynamic_inputtrigger - Runs must use
priority: true(priority mode) - Only one
streaming_resultstep is allowed per agent definition
Setting Up Streaming
Workflow Structure
Place a streaming_result step as a direct child of a prompt_call step:
Step 1: prompt_call (model with streaming support)
└── Step 2: streaming_result
The streaming_result step must be a direct child of prompt_call — no intermediate steps are allowed between them.
Prompt Call Requirements
The parent prompt_call step must use a streaming-capable model — the selected model must have the supports_streaming capability. Most major models from Anthropic, OpenAI, and Google support streaming. Check the model's capability badges in the model selector. Both simple format (plain text) and advanced/JSON template prompt calls support streaming.
If the model does not support streaming, the full response will be delivered as a single chunk when complete — the agent run still succeeds, but without real-time token delivery.
Side-Effect Steps
To perform side effects (save to memory, send email, publish content) alongside streaming, add those steps as sibling branches under the same prompt_call:
Step 1: prompt_call
├── Branch 1: streaming_result (real-time delivery)
└── Branch 2: add_memory (save to memory bank)
Sibling branches receive the full buffered output after streaming completes. They execute in parallel with normal scheduling.
Consuming Streaming Events
SSE Event Types
When using the POST /agents/{'{agent_id}'}/runs/stream endpoint, the following streaming-specific events are emitted:
stream_token — An individual token from the LLM response:
event: stream_token
data: {"step_id": "streaming_step_1", "token": "Hello", "seq": 1}
stream_end — Streaming is complete for a step:
event: stream_end
data: {"step_id": "streaming_step_1"}
These events are interleaved with standard run events (step_completed, done, etc.).
JavaScript Client Example
const response = await fetch(`/api/agents/${"{agentId}"}/runs/stream`, {
method: "POST",
headers: {
"Content-Type": "application/json",
"X-API-Key": apiKey,
},
body: JSON.stringify({ input: "Tell me about climate change" }),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
let streamedText = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
let boundary;
while ((boundary = buffer.indexOf("\\n\\n")) !== -1) {
const message = buffer.slice(0, boundary);
buffer = buffer.slice(boundary + 2);
const eventMatch = message.match(/^event: (.+)$/m);
const dataMatch = message.match(/^data: (.+)$/m);
if (!eventMatch || !dataMatch) continue;
const event = eventMatch[1];
const data = JSON.parse(dataMatch[1]);
if (event === "stream_token") {
streamedText += data.token;
updateUI(streamedText); // Update your UI incrementally
} else if (event === "stream_end") {
finalizeUI(); // Streaming complete
} else if (event === "done") {
// Run completed — data contains the full run response
break;
}
}
}
Time to First Token
Streaming runs record a server-side time-to-first-token (TTFT) — the wall-clock elapsed time from when the run started to when the first SSE token was emitted by the server. TTFT captures the full responsiveness picture up to that point: input scan, governance gate, step dispatch, prompt assembly, model warm-up, and the first byte of generated output. Network buffering and round-trip time between the server and the client are not included; the Run Agent modal records a separate client-perceived TTFT for that view.
Where to see TTFT:
- Run Agent modal displays a
TTFTbadge next to the streaming output as soon as the first token arrives. This is the client-perceived value (includes network RTT to the SSE server). - Trace details page surfaces the server-measured value in the run header (
Time to first token: …) and on each streamingprompt_callstep. GET /authenticated/agents/runs/{'{run_id}'}(polling) returnsfirst_token_msonce the run records its first token. Useful for clients that want the authoritative server value.GET /authenticated/agents/runs/{'{run_id}'}/details(trace) returns both the run-levelfirst_token_msand a per-stepfirst_token_mson each streaming step.- MCP
get_agent_run_statusreturnsfirst_token_ms(run-level) and per-stepfirst_token_mswheninclude_step_outputs=true. - Resource export (
agent_traces) includesfirst_token_aton each run plusfirst_token_aton each attempt for offline analytics.
first_token_ms is null for non-streaming runs and for runs that errored before any token was emitted. The run-level value is set once, by the first streaming step that produces a token; multi-streaming-step runs (rare) capture only the earliest occurrence.
Fallback Behavior
Streaming falls back to non-streaming (single-chunk delivery) in these cases:
- Non-priority mode — Streaming requires
priority: true; without it the full response is delivered as one chunk - Model doesn't support streaming — The full response is delivered as one chunk
- Governance output policies — When blocking governance policies are active on the step, streaming is disabled to allow content screening before delivery
- Tool-calling prompts — The tool loop completes fully first, then the final LLM response is streamed
- MCP server — Streaming is not supported via MCP; falls back to normal execution