First long-form speech synthesis platform for publishers and creators

The first high quality long-form speech generation platform

By ElevenLabs Team in Product — Oct 17, 2022

This November, we're launching the first speech synthesis platform that allows publishers and creators to generate high quality, emotionally compelling long-form content.

Who’s it for?

We chose this direction for a number of reasons. There is currently no tool which supports generating long-form speech in high enough quality to make it suitable for voicing news or audiobooks. Our team are keen listeners of all things audio and we felt that rising to the challenges posed by lengthier content is a natural step towards realising our ambitions. But we’re also particularly excited to consider it our stand-out feature - we’re the first AI speech tech platform to bring the most emotive, rich and lifelike voices to creators and publishers seeking the ultimate storytelling quality.

To this extent, our platform allows you to generate and download high quality, voice actor-grade speech from any text - be it news articles, books, newsletters, blogs or academic papers. You can choose any voice to read content - either from a set of pre-defined synthetic voices, or by cloning a voice from a sample you provide. The uses we imagine for our tech are endless. From providing existing content with cross-medium accessibility, through raising productivity, to reviving texts from the past by converting them to audio, or creating new content. Our next objective is extending support to other languages.

How’s Eleven different?

How we achieve this is down to the way we’ve built our model. It’s trained to understand what is being said and to adjust delivery accordingly. It does this by taking into account not just the meaning of words but also the context surrounding each utterance.

Traditional speech generation algorithms produce utterances on a sentence-by-sentence basis. This is computationally less demanding but immediately comes across as robotic. Emotions and intonation often need to stretch and resonate across a number of sentences to tie a particular train of thought together. Tone and pacing convey intent which is really what makes speech sound human in the first place. So rather than generate each utterance separately, our model takes the surrounding context into account, maintaining appropriate flow and prosody across the entire generated material. This emotional depth, coupled with prime audio quality, provides users with the most genuine and compelling narrating tool out there.

Hear the difference - Eleven vs Microsoft Azure:

Microsoft Azure Text-to-Speech

J.K. Rowling, Harry Potter and the Philosopher's Stone, Fragment 1

0:00

/40.599542

J.K. Rowling, Harry Potter and the Philosopher's Stone, Fragment 2

0:00

/37.06775

Eleven Labs Speech Generation

J.K. Rowling, Harry Potter and the Philosopher's Stone, Fragment 1

0:00

/36.996458

J.K. Rowling, Harry Potter and the Philosopher's Stone, Fragment 2

0:00

/28.816375

Become our beta-tester

Our platform will go live next month and you can register to become our beta-tester today at elevenlabs.io

audiostory.ai

If you’re curious to hear our software at work, go to audiostory.ai - a side-project by Eleven Labs aimed at showcasing our long-format speech generation capabilities where we use our synthetic voices to read news articles and books from the past. The first episode is an 1899 article from The New York Times on the invention of radio - listen to it here. Or, if you haven’t already, you can go to the top of this page and listen to this entry read aloud.

Who’s it for?

How’s Eleven different?

Become our beta-tester

audiostory.ai

Try ElevenLabs today