The Cologne-based translation company best known for its text tools has unveiled a full voice product suite covering meetings, conversations, group settings, and an API for enterprise integration. A live demo in Seoul showed one-to-two sentence delays, and DeepL’s CPO acknowledged word order differences between languages remain a fundamental challenge.
DeepL, the Cologne-based language AI company that built its reputation on high-quality text translation, has launched DeepL Voice-to-Voice: a real-time spoken translation suite designed for live business communication.
The product covers four distinct use cases, virtual meetings, mobile and web conversations, group settings for frontline workers, and enterprise applications through an API, and supports more than 40 languages including all 24 official EU languages and additions such as Vietnamese, Thai, Arabic, Norwegian, Hebrew, Bengali, and Tagalog.
The suite’s four components are at different stages of availability. Voice for Conversations, which enables real-time translation across mobile and web without requiring app installation, is now generally available.
Voice for Meetings, which integrates with Microsoft Teams and Zoom so participants can speak in their native language while others hear simultaneous translation in theirs, is opening an early access programme in June.
The Voice-to-Voice API, which lets businesses embed DeepL’s translation engine into their own customer-facing applications such as call centres, is in ongoing early access. A customisation feature, Spoken Terms, which allows the system to learn industry-specific vocabulary, company names, and personal names, is scheduled to become generally available on 7 May.
Jarek Kutylowski, DeepL’s founder and CEO, described the launch as reaching “another frontier in translation.”
“DeepL Voice-to-Voice allows everyone to speak naturally in their own language without the friction or cost of interpreters,” he said.
DeepL has positioned the product as an enterprise tool rather than a consumer one: the company said its voice technology never uses customer data to train its models, and does not permanently store transcription or translation data after a call ends, a security framing that distinguishes it from consumer AI voice products and is aimed at regulated industries.
The current system works through a three-step pipeline: speech is converted to text, the text is translated using DeepL’s established translation engine, and the output is then converted back to speech.
DeepL’s competitive argument rests on the quality of the middle step: the company says its text translation models outperform alternatives, and that advantage propagates through to the voice output.
In blind evaluations commissioned by DeepL and conducted independently by Slator, a language industry research firm, 96% of professional linguists preferred DeepL Voice over the native translation solutions in Google Meet, Microsoft Teams, and Zoom, citing superior fluency and contextual accuracy. DeepL Voice scored 96.4 out of 100 for Zoom and 96.3 for Microsoft Teams.
However, a live demonstration by Chief Product Officer Gonzalo Gaiolas at the company’s DeepL Connect Seoul event, held on 15 April, exposed the system’s current limitation: a visible delay of one to two sentences between the speaker finishing and the translation being delivered.
Gaiolas acknowledged the lag directly. “Different languages have different word orders and sentence structures, which causes delays in real-time interpretation,” he said, according to Seoul Economic Daily.
The company plans to reduce latency through continued model development. On the voice quality side, the current system translates using a fixed synthetic voice; DeepL said it plans to release a voice-preservation feature, which maintains the speaker’s original voice characteristics in the translated output, by the end of 2026.
DeepL is entering a market with multiple well-funded competitors. Sanas, which uses AI to modify speakers’ accents in real time for call centre applications, raised $65 million in a round led by Quadrille Capital.
Dubai-based Camb.AI focuses on speech synthesis and translation for media dubbing. Palabra, backed by Reddit co-founder Alexis Ohanian’s Seven Seven Six, is developing a real-time speech translation engine focused on preserving speaker voice characteristics.
Google, Microsoft, and Zoom all offer their own meeting translation features, the platforms DeepL is simultaneously challenging and integrating with. DeepL’s strategic bet is that translation quality, its longest-established differentiator, can outweigh the structural advantages incumbents hold in platform distribution.


