Stepping into the sound studio with voice agents
Directing a voice when you can't be in the room
Everything stops. The deep, vault door seals shut and encases you in padded, sound-absorbing walls. There’s something about the moment you step off the screaming streets of Soho and into a sound studio. It’s more than the noise quietening down – it’s a disorientating nothingness that takes a moment to adjust to. Even the noise of putting your bag down and sitting on the sofa seems strange.
But you make yourself comfortable nonetheless. Today’s the day for a performance that’s been many months and script re-writes in the making. By the end of the session, the perfect read for a 30-second radio advert will have been crafted.
The ‘creatives’ (the scriptwriters) and the producer huddle on a sofa. The sound engineer commandeers everything technical from his vast flight deck. Then you have ‘the talent’ — the voice actor sitting in the glass-walled booth, selected because of their vocal character and experience to embody the words you’d until now only heard in your head.
Together you need to craft a natural, conversational delivery in a space that’s anything but.
The actor starts with a few warm-up reads.
The engineer zones in, adjusting knobs, dials, finding a shorthand with the actor.
Timing and pace start to take shape.
Everyone scribbles notes.
Tone gets analysed.
Stresses are debated.
Take 5 — nice lines 4–6, except the last word.
Take 12 — good opening line.
Take 22 — …
Tension starts rising. More water gets requested.
The mood oscillates throughout the session, but it’s productive to keep it upbeat.
There are different tricks. The read that’s “just a sound test” that often ends up relaxing the artist into the perfect performance. The improv read, i.e. letting the actor adjust things so they feel natural to them even if they’re a touch off-script. It’s always a balance of giving specific direction versus creating the environment for the right performance to just happen.
Plus, with the sound engineer you can rework the minutiae of every beat. Lines get carved up and restitched. Sometimes even an individual word gets sliced between syllables. Breaths are removed, pauses are perfected, pace is spread out, until finally the words are perfect and the timing lands on 30 seconds exactly. The listener would have no idea the amount of artistry that’s gone on behind the scenes.
The whole process is a fascinating mix of talent, psychology, and engineering.
In the era of voice agents, where an AI voice model can mimic human speech as an interface, there’s much we can draw from traditional voiceover artistry – How to use character, what sounds natural and the importance of creating the right environment for perfect performance.
Except for a single screaming difference – control.
Far from being able to review and refine every syllable, a voice agent will have thousands of conversations you’ll never hear and perform in situations you didn’t script for. This requires a lot more letting go than most designers are used to.
In fact, voice as an interface can be a steep learning curve for those who are getting these briefs for the first time. Product designers know how to shape visual interfaces. Brand designers and copywriters regularly work with ‘tone of voice’ guidelines. But bringing that brand to life through a voice agent – with pace, breath, and emotion, and that’s before we get to agency, accuracy, and compliance – is a whole new dimension.
Why it’s so important to get voice right
Using voice agents can be a fantastic way to provide a rich, empathetic and nuanced interaction. For businesses, voice agents can be used to augment and scale operations, supporting always-available, proactive customer service experiences. Voice agents can guide users through complex flows and provide real-time clarification, adapting in real time to what each user actually needs. Voice has the potential to be a particularly special customer interface.
It’s the most human interface we’ve ever designed
While computer-to-human voice interfaces are young. Our ‘human-to-human voice interface’ is embedded deeply into us – alive in the womb, and attuned in our body. Foetuses have been shown to have a preference for recordings of their mother’s voice over others. At just one day old, babies can distinguish speech from other sounds (with their left hemisphere lighting up for speech and the right for all other sounds). Plus our right ear (that is processed by the left, speech-focused hemisphere) has developed tiny hairs that exist purely to funnel voices over all other sounds.
So as humans, our bodies know that voices are special. But the implication for voice interfaces is fascinating. Research has proven that even when we hear even computer-generated voices the same neural and social machinery we use to process human speech kicks in automatically. Before we’ve consciously evaluated what we’re hearing, or even if we tell ourselves it’s ‘just a computer’, our brains can’t help but process the voice as a human — as a social presence with intent and character.
This makes voice the most human computer interface we’ve ever designed for.
With it, comes high levels of human judgement
This also makes it very risky though. Even when we know it’s not human, we can’t help but judge it in the same way, meaning the complex human relational dynamics come along with it — we demand politeness, social awareness, empathy, and the first impressions we form are fast, instinctive and surprisingly hard to revise. In the same way you’d judge someone’s whole character if they said something inappropriate about a sensitive subject—one odd inflection from a voice agent can cause a lot of brand damage, much more than a broken visual interaction.
But society has been quietly readying for it
Voice agents are arriving into a culture that’s been unconsciously preparing for them for a while.
Many households have been using voice commands with Alexa to ‘add milk to the shopping list’ or ‘Play Stormzy‘ for a decade now. So we’ve had time with it to feel the convenience of voice as an input.
Plus AI-enabled voice-to-text dictation tools like Wispr Flow have now become highly accurate and are seeing huge growth—people find it easier than typing, can focus on their thoughts, keep their heads up and not be so hunched over their keyboard. Peter Steinberger, the entrepreneur who singlehandedly built OpenClaw said in a recent interview that he barely types anymore—a sign that people’s relationships with the computer are changing.
Our phones have had so much friction removed from the interactions: Forms are pre-filled for us, security is completed through a simple glance, emails are pre-written, and entertainment… just a thumb gesture away. By comparison, typing can feel like a digital quill pen—laborious and primitive.
Socially, habits have been changing—voice notes have gone mainstream. While not loved by all, there’s a whole generation that have grown up with cameraphones and so have no shyness around a microphone. They seem much more ready to now use voicenotes as a default mode of communication—seeing it as a more intimate, efficient, and expressive alternative to texting.
Voice isn’t the future of every interface. But for the right use cases, the cultural groundwork has definitely been laid.
So how can you craft a successful voice agent ‘performance’?
Creating the character
Be clear on who it is and its role
Sounds obvious, but is the voice the brand itself or is the user speaking to an agent of the brand? The first is rarer and tends to be when the voice agent is the product itself—as with general AI platforms like ChatGPT. Here users are often given much more choice in shaping the persona and voice. But when it comes to existing brands, you’d find it odd to suddenly start talking to ‘Nike’ as a person, but you might quite naturally talk to someone Nike had created to help you find the right shoe. Most brands moving into voice are creating characters—entities that are separate from the brand. It feels more natural and protects the brand somewhat. A character can evolve, be refined, even retired—the brand stays intact.
Review your brand guidelines, they’re likely not up to the job
Until now, your brand guidelines will have been focused on guiding visuals and written copy, perhaps alongside some detail about mission and values. But now we need to interrogate them with the lens of voice in mind. Look at your brand through the lens of casting a character. How does it sound? How would it say things? How would the voice behave? How does this adjust depending on the voice’s role?
If you can’t answer what the voice does when a user is frustrated, or what it never says under any circumstances, the guidelines aren’t ready for voice yet.
Your internal brand and employee onboarding docs might actually be more help
A voice agent is more like a member of staff than an advertising voiceover. Because of this, onboarding and employee guidelines can be a useful reference. “What do we tell a new customer service hire? What does a good conversation look like?”, “How should they never act?”, “What behaviour supports customer retention”, “What regulations do they need to abide by” — these are all things that need to be considered.
Writing the script
Step 1: Bin the idea of a script
It’s impossible. There’s too many iterations. And even if you tried, your agent will sound too robotic.
We tried a bit of this when we were first creating Finley, a voice-enabled AI financial advisor. Knowing that financial advice conversations had to comply with certain rules, we gave him what we wanted him to say at certain points. And it went exactly how you’d imagine it to sound if someone brought a pre-written script to a conversation. Like a self-centred idiot who ignores what you just said and goes straight back to what they want to discuss with zero awareness. Urgh, I’d rather have poor financial management than tolerate this so-called ‘conversation’.
What worked for us wasn’t a script, it was a system.
A system of agents with goals
A multi-dimensional map of agents each with individual goals and tools, carrying parts of the conversation, handing off to each other behind the scenes, all upholding consistent standards and personality, so to the user it seems like a single agent having a fluid conversation across various aspects of their financial life.
Just like in the sound booth, when we were too specific, the voiceover artist could often end up being wooden. But talking to them about the end goal – the feeling we wanted to create in the listener and allowing them to interpret it in their own way often produced the perfect result. Give the agents your goals, and allow them to put it in their own words in whatever way suits the moment will get a much more human result.
Armed with context
What does the agent know, in any specific moment, about the person it’s talking to?
Context injection is the practice of populating an AI agent’s prompt with dynamic, situationally relevant information before the model reasons. It’s the difference between asking someone “how can I help you?” and asking “I noticed you’ve been stuck on this for twenty minutes and the deadline is tomorrow, what do you need?”
The second question demonstrates care, attention and speeds up the conversation. In a human conversation, this comes from attention, relationship, shared history. In an LLM-based product, it comes from whatever a designer chose to inject.
Context can be almost anything: the user’s current task, their role, their past interactions, their preferences, the time of day, whether they’re a new user or a power user, what they just said, what they said three sessions ago. It’s also about what not to know — balancing what’s useful versus what’s intrusive.
Casting its voice
You’ll likely get to a stage of reviewing voices from a range of providers (Either third parties like ElevenLabs or native voices like the ones OpenAI provide)
What kind of voice best represents your brand and the agent’s role—is it down-to-earth? Kind? Confident? And what does that actually mean in terms of gender, accent, age, class, pitch and pace?
Brand values are really tested when they have to be embodied. You can write “warm but authoritative” in a brief. But when an artist is standing in a booth, you have to know what it actually means—what carries the warmth, where the authority lives, how the two coexist in the voice—it’s the same with voice agents.
Training your ear will help build conviction. Spend time with voice experiences that feel good and ones that don’t, and work out why. It takes practice to hear what’s actually going wrong rather than just feeling it. I’ve written a separate blog post specifically on the architecture of voice conversations but a few things worth listening for.
Prosody: the rhythm, stress, and intonation of speech—is what people notice first and can articulate least. When a synthetic voice feels “off,” users will say exactly that, without being able to say why. It’s usually prosody that’s letting it down.
Register: Is the formality level right for the context? A debt collections call needs a different register to a loyalty rewards enquiry
Confidence: Does the voice elicit reassurance and a feeling of being in safe hands? This is distinct from warmth—it’s about a feeling of competence as well as care.
Voice is a very personal thing, so test, test, test with users.
Who does the user actually want to hear from?
Do they want someone who mirrors themselves? Or who they believe matches the role? Which voice would build trust?
Social psychology can be contentious here. Research consistently shows that voice preferences track cultural stereotypes in ways that may not align with a brand’s values. For example, studies have found that users in financial services contexts still default to preferring male voices when making high-stakes decisions—decades of cultural association have shaped the expectation. Likewise AI assistants have been criticised as they are primarily female-voiced due to historical gender stereotypes, user preference for helpfulness, and perceived warmth. Studies suggest people find female voices more trustworthy and soothing, leading to their adoption in “subservient” service roles, which then reinforces stereotypes that women are better suited for assisting tasks. Separately, regarding accents, research found significantly higher ‘truth ratings’ for statements delivered in a native accent compared to foreign accent, with New Zealand English speakers rating a New Zealand-accented AI as more credible than a Scottish-accented one.
I don’t want to say you have to capitulate to ingrained biases, but, it’s no use pretending they don’t exist.
BMW once had to recall a whole fleet of vehicles because male drivers were so incensed at the idea of taking directions from a female voice—“it can’t possibly be right!” (Kind of ironic that the ‘helpfulness factor’ seen in AI assistants didn’t seem to crossover into driving).
Really, I hope the rise of voice agents will fuel research deeper into the aspects of voice and meaning, as perhaps this gives us the chance to uncover the tenets of voice that embody ‘helpfulness’ or ‘authority’ beyond the gendered default.
But for today, at a time when you’re challenging your user with the strange new experience of a voice agent, is this the time to challenge their subconscious beliefs too?
Where possible, providing options or a mix of personas seems a decent middle ground. While it doesn’t challenge ingrained biases, it does refuse to strengthen them.
Creating the right environment for the perfect performance
Now—you’ve written your character, their ‘script’ system and you’ve cast its voice. It’s time to hook them up to a mic and put them live.
But before they go on—are you doing everything you can do help this go well. How are you preparing the user or listener? Are they comfortable with speaking to a voice agent? Do they have a choice? How are you building trust before you’ve even put them in the room with your voice agent? A little introduction on a screen or on call can make all the difference as how the voice agent is received and whether it earns acceptance.
The moment of truth: how it all comes together in a conversation
This is where you meet your sound engineer.
In the studio, the engineer is the one shaping every breath, pause and stress between the actor and the listener’s ear. With voice agents, that role is played by your voice architecture — the layer that decides what reaches the user and how. The choice made here defines what kind of performance is even possible.
I am simplifying it quite a lot, partly because it’s changing every few months (throughout our Finley build the architecture was being updated based on what was being released), and partly because, while you have many knobs to tweak in the voice model and other parts of the architecture that can incrementally change things, one core choice really comes down to a single experiential trade-off: how much of the emotional signal do you need to keep?
Two architectures dominate today and they sit at opposite ends of that trade-off:
A cascade pipeline (currently the most common) transcribes speech into text, sends that to the LLM, then converts the response back into speech. It’s modular, debuggable, well-supported with tools, and latency has become competitive (~300–500ms with streaming). But the moment speech becomes text, the how it was said is gone. The sigh, the hesitation, the rising pitch that meant “I’m not sure” — the receiving system never hears it. You’re directing a performance from a transcript.
Native speech-to-speech (OpenAI Realtime, Gemini Live) skips the text layer entirely. The model hears the voice and replies in voice. This improves latency, interruption handling feels natural, and some emotional signal survives the round-trip. But you lose modularity, tool-calling is weaker, and you’re more locked into a single vendor’s roadmap.
Other architectures are emerging that try to split the difference. The space is moving fast and perhaps at some point the trade off will disappear. But to be honest the point is less about which one you pick — it’s about owning it as a design decision and working with the engineers. A cascade pipeline will give you a confident, controllable performance that may feel slightly flat in emotional moments. A native model will feel more alive, but harder to direct precisely. Neither is wrong. The right answer depends on whether your use case lives or dies on emotional nuance, or on consistency and control.
The studio is everywhere now
Step back into the studio for a moment. The padded walls, the flight deck, the actor in the glass booth, the producer with their notes. That room was where brand voices have been perfected since the invention of the radio a hundred years ago. One performance at a time, every syllable accounted for, the door sealed shut against the world.
The voice agent inverts all of it.
The studio isn’t a room you walk into anymore. It’s the system you build — the character, the script-that-isn’t-a-script, the casting, the architecture, the context, the goals you hand to your agents and the trust you place in them to interpret it. And the performance isn’t a 30-second take you carve up over an afternoon. It’s thousands of conversations happening right now, in cars and kitchens and call queues, none of which you’ll ever hear.
The skill, in the end, isn’t directing the voice. It’s designing the studio so well that the right performance happens whether you’re listening or not.


