Documentation

Agent Streaming

The streaming_result step type enables real-time token-by-token delivery of LLM output to clients via Server-Sent Events (SSE). This is ideal for chat-like interfaces where users should see the response being generated incrementally.

Requirements:

  • The agent must use a dynamic_input trigger
  • Runs must use priority: true (priority mode)
  • Only one streaming_result step is allowed per agent definition

Setting Up Streaming

Workflow Structure

Place a streaming_result step as a direct child of a prompt_call step:

Step 1: prompt_call (model with streaming support)
  └── Step 2: streaming_result

The streaming_result step must be a direct child of prompt_call — no intermediate steps are allowed between them.

Prompt Call Requirements

The parent prompt_call step must use a streaming-capable model — the selected model must have the supports_streaming capability. Most major models from Anthropic, OpenAI, and Google support streaming. Check the model's capability badges in the model selector. Both simple format (plain text) and advanced/JSON template prompt calls support streaming.

If the model does not support streaming, the full response will be delivered as a single chunk when complete — the agent run still succeeds, but without real-time token delivery.

Side-Effect Steps

To perform side effects (save to memory, send email, publish content) alongside streaming, add those steps as sibling branches under the same prompt_call:

Step 1: prompt_call
  ├── Branch 1: streaming_result (real-time delivery)
  └── Branch 2: add_memory (save to memory bank)

Sibling branches receive the full buffered output after streaming completes. They execute in parallel with normal scheduling.

Triggerdynamic inputRetrievalsearch KBPrompt Callstreaming modelStreaming ResultSSE → clientAdd Memorysave replySend Emailnotify user
Figure 1.Streaming with side-effects — the prompt call streams tokens to the client in real-time while memory and email steps run in parallel after the full response is buffered.

Consuming Streaming Events

SSE Event Types

When using the POST /agents/{'{agent_id}'}/runs/stream endpoint, the following streaming-specific events are emitted:

stream_token — An individual token from the LLM response:

event: stream_token
data: {"step_id": "streaming_step_1", "token": "Hello", "seq": 1}

stream_end — Streaming is complete for a step:

event: stream_end
data: {"step_id": "streaming_step_1"}

These events are interleaved with standard run events (step_completed, done, etc.).

JavaScript Client Example

const response = await fetch(`/api/agents/${"{agentId}"}/runs/stream`, {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "X-API-Key": apiKey,
  },
  body: JSON.stringify({ input: "Tell me about climate change" }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
let streamedText = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  buffer += decoder.decode(value, { stream: true });

  let boundary;
  while ((boundary = buffer.indexOf("\\n\\n")) !== -1) {
    const message = buffer.slice(0, boundary);
    buffer = buffer.slice(boundary + 2);

    const eventMatch = message.match(/^event: (.+)$/m);
    const dataMatch = message.match(/^data: (.+)$/m);
    if (!eventMatch || !dataMatch) continue;

    const event = eventMatch[1];
    const data = JSON.parse(dataMatch[1]);

    if (event === "stream_token") {
      streamedText += data.token;
      updateUI(streamedText); // Update your UI incrementally
    } else if (event === "stream_end") {
      finalizeUI(); // Streaming complete
    } else if (event === "done") {
      // Run completed — data contains the full run response
      break;
    }
  }
}

Time to First Token

Streaming runs record a server-side time-to-first-token (TTFT) — the wall-clock elapsed time from when the run started to when the first SSE token was emitted by the server. TTFT captures the full responsiveness picture up to that point: input scan, governance gate, step dispatch, prompt assembly, model warm-up, and the first byte of generated output. Network buffering and round-trip time between the server and the client are not included; the Run Agent modal records a separate client-perceived TTFT for that view.

Where to see TTFT:

  • Run Agent modal displays a TTFT badge next to the streaming output as soon as the first token arrives. This is the client-perceived value (includes network RTT to the SSE server).
  • Trace details page surfaces the server-measured value in the run header (Time to first token: …) and on each streaming prompt_call step.
  • GET /authenticated/agents/runs/{'{run_id}'} (polling) returns first_token_ms once the run records its first token. Useful for clients that want the authoritative server value.
  • GET /authenticated/agents/runs/{'{run_id}'}/details (trace) returns both the run-level first_token_ms and a per-step first_token_ms on each streaming step.
  • MCP get_agent_run_status returns first_token_ms (run-level) and per-step first_token_ms when include_step_outputs=true.
  • Resource export (agent_traces) includes first_token_at on each run plus first_token_at on each attempt for offline analytics.

first_token_ms is null for non-streaming runs and for runs that errored before any token was emitted. The run-level value is set once, by the first streaming step that produces a token; multi-streaming-step runs (rare) capture only the earliest occurrence.


Fallback Behavior

Streaming falls back to non-streaming (single-chunk delivery) in these cases:

  • Non-priority mode — Streaming requires priority: true; without it the full response is delivered as one chunk
  • Model doesn't support streaming — The full response is delivered as one chunk
  • Governance output policies — When blocking governance policies are active on the step, streaming is disabled to allow content screening before delivery
  • Tool-calling prompts — The tool loop completes fully first, then the final LLM response is streamed
  • MCP server — Streaming is not supported via MCP; falls back to normal execution