Talking to your agent by voice

Alfe agents can talk. Voice lets a person speak to an agent and hear it answer out loud, in a natural back-and-forth conversation. Under the hood the platform transcribes what the caller says, hands the text to the agent, and speaks the agent’s reply back — so the agent works with text as usual while the person on the other end just talks and listens.

Voice is a mode, not a separate agent

Voice doesn’t create a second, “voice-only” agent. It’s a mode layered on top of the channels an agent already has — the same agent, the same memory, the same integrations, now reachable by speech. The most common way to reach an agent by voice is to give it a phone number and call it, but the capability is the same wherever voice is available: your words are converted to text for the agent, and its answers are converted back to speech for you.

Because the agent always sees text, everything else about it carries over unchanged. It can search its memory, call integrations, and act on what it hears, then respond in the same conversation.

What a voice conversation feels like

Real-time. The agent starts speaking as its reply is ready, rather than waiting for a whole answer to be written first, so conversations feel responsive.
Two-way. People can interrupt and the agent will stop and listen — a normal, interruptible conversation rather than a rigid prompt-and-wait.
Inbound and outbound. An agent can answer incoming calls and, where a channel supports it, place outgoing ones.

On-demand speech for developers

Beyond live conversations, Alfe exposes two simple endpoints so an agent — or your own code acting as an agent — can convert between text and speech directly. Both are authenticated with the agent’s API key as a bearer token, the same way as the rest of the Agent API:

Authorization: Bearer <agent-api-key>

Text to speech

POST /voice/tts turns text into spoken audio.

Send a JSON body with the text to speak (up to a few thousand characters per request).
Receive an audio buffer of the spoken result. The response describes its format in headers (sample rate, channels, and bit depth) so you can play or save it.

Speech to text

POST /voice/stt transcribes spoken audio into text.

Send the audio as the request body, telling the endpoint the audio’s sample rate.
Receive a JSON response with the transcribed text and a confidence score.

Both endpoints draw on the tenant credit pool — see how voice is billed.

Where to go next

Phone & SMS — give an agent a number so people can call or text it.
How voice is billed — how voice usage draws on your credit pool.