Overview: Speech Synthesis and Conversational AI

Over the past year, we’ve noticed a number of recent innovations around speech synthesis and conversational AI. So far, the state of the art in this area is not yet at the point where it can be realistically applied immediately in a game or VFX pipeline. That said, innovation never stops, and new techniques and interesting innovations emerge regularly within this field. In this blog post we bundle these for you.

Speech synthesis

Speech synthesis is an area of significant research and continuous evolution. Simple tools or techniques are available for simple applications. However, if you want to clone your own voice or that of voice actors with a high level of realism, this still involves an intensive data collection process. Below is a brief overview of various services regarding speech synthesis.

Realtime Voice Cloning

During the preparation year we briefly discussed Real-Time Voice Cloning, a paper from January 2019. This technique makes it possible to clone a voice based on a short voice fragment. The resulting audio sample is understandable yet mostly robotic, with only a hint of the original voice. Finally, there is little control over the final result, only adding extra embeddings (voice fragments) or reformulating the sentence in case of unwanted artifacts.

The implementation of this can be found on GitHub and was recently converted to PyTorch by the open source community in February 2021. Getting the demo working on your own PC still requires some technical knowledge. Here you will find the correct installation instructions for Windows and Linux. A video on how Real-Time Voice Cloning Toolbox works with the results can be found here. The author of the paper would like to refer you to Resemble, where he started working after his thesis.

Resemble

Resemble offers the same functionality as the previous open source repository, but in a user-friendly package with additional services, such as AI-generated text, depending on the subscription model you choose. The result is considerably better than the now two-year-old repo, but it still won’t fool anyone. Halfway down the main page you can listen to a Dutch text fragment. The voice will not replace voice actors for the time being, but may be interesting as a placeholder or can be used for internal demos.

Sonatic

This service allows you to convert your scripts to audio, this time with a strong focus on control over the emotional tone. The voices sound a lot more human. At the bottom of the website you will find a demo that you can get started with. The limitation here is that you cannot clone or add voices yourself. The advantage of this select library is the high voice quality and the power of the emotional aspect. This is certainly interesting for studios where a large amount of dialogue is subject to change.

15.ai

This project and code has not yet been released at the time of writing. The related paper will most likely be published under the name “Natural realistic emotive high-fidelity faster-than-real-time text-to-speech synthesis with minimal viable data.”

The name of the website and service 15.ai has no clear meaning, although it is sometimes joked that this is the number of days the service is available per year. The developer of this free online toolbox wants nothing less than perfection and only puts the site online when the quality is sufficient in his opinion. With a bit of luck you will find a beta version online under the subdomain https://final5.15.ai or an increment thereof (final6, final7, ..).

The voices of this toolbox are limited to fictional characters. Despite the limited data, the voices are of high quality. Here too, there is the option to link emotions to voice fragments.

This makes 15.ai very popular among fan communities, from Team Fortress 2 to My Little Pony enthusiasts. This way, fans can easily bring their strange dialogues and stories to life, with or without self-made animations. This is the most viewed 15.ai video on YouTube. You will find even more examples under “15.AI TF2”. Please note, the fan communities have quite specific humor and often use obscene terms in their creations. If the service is not online, here you will find an overview of the interface of the most recent version that appeared a few days ago. Furthermore, a similar service https://uberduck.ai/ has recently appeared, albeit with a more limited offering and voices of much lower quality. An account with Google or Discord is required if you want to test this yourself. Be sure to keep an eye on the url of 15.ai, the paper being written around this is very promising. The ability to generate emotional speech with a minimum of data is unprecedented and one of the most difficult challenges within the field of speech synthesis.

Coqui.ai

While previous services were always available from the browser or in the form of consumer software, the coqui.ai toolkit is for a slightly more technical audience. The original version Mozilla TTS was continued by the author on his own through an open source project and under a new name.

What is unique about coqui TTS is that we can train any type of voice with reasonable realism, in any language. The challenge here is to set up the project and find or compile a qualitative dataset yourself.

I myself trained the next Dutch voice with Tacotron using a self-compiled dataset within DAE Research. If you are technically minded you can use the same model with the next commit. I recommend training your own model with the up-to-date coqui repository. This will ensure the best results.

Conversational AI

In 2019, an internal research project took place into the possibilities of Conversational AI within the context of Howest. This project ultimately concluded that setting up a complete chatbot for, in that case, a school helpdesk, is still too time-consuming to be practically feasible. In addition, existing techniques still had difficulty with referencing previous sentences, let alone a previous event with the same person. While this is still a major challenge to this day, there have been interesting developments around generative AI since then, most notably through GPT.

GPT-3

GPT-3, or Generative Pre-trained Transformer 3, from OpenAI is an impressive language model. Not only in terms of linguistic capabilities, which can be measured in various ways, but also the amount of computing power (cost around 12 million dollars) and the mountain of data to train such a model. With no fewer than 175 billion parameters for the largest variant of GPT-3, it surpassed the previous model by a factor of 10 in size in February 2020.

Numerous examples of its capabilities can be found on the openai website. Deze video is a nice example of the chat capabilities of this model. Please note that the avatar and speech were edited afterwards. This conversation was not real-time.

Unfortunately, GPT-3 is still not openly available. This is because OpenAI has become a for-profit organization since 2019 and Microsoft has invested more than $1 billion in the company since then. It therefore goes without saying that only they have access to the source code, and that other parties should be happy with using the public API. This language model, combined with training data from the millions of code repositories on GitHub, led to the creation of Microsoft GitHub CoPilot. This is available as a plugin in Visual Code. This service allows developers to automatically generate code based on their comments.

GPT-NEO / GPT-J-6B

These are the open-access variants of GPT-3, trained by EleutherAI. You can provide a sentence (prompt) online in English that complements GPT for you in a creative way. The ‘small’ Neo version, 2.7 billion parameters, is available on huggingface. You can also test the capacities of the largest variant with 6B parameters at https://6b.eleuther.ai/, although the latter may take a while before the output appears on your screen. Nevertheless, the results can be quite interesting. You can also download the 825GB text dataset The Pile on which this model was trained, provided you have the storage space for it.

Other language models to keep an eye on

There really are a lot of language models, each with their own capabilities. Here is a short list of the most important ones at the moment.

LaMDA

LaMDA stands for Language Model for Dialogue Applications. This conversational model was showcased in May at the annual Google I/O conference as the next generation of chatbots. Although the results seem promising, I wouldn’t wait too impatiently just yet. Maybe you remember the hype surrounding Google Duplex? Three years later, the technology is still not fully available in the US, let alone the rest of the world.

RobBERT

So far, the language models we have discussed have only been useful in English. A language model that we have not discussed here, Google BERT, has been trained in Dutch by KU Leuven. Unlike GPT, RobBERT is not capable of generating text. The code and an overview of the capabilities of what RobBERT can do can be found on the github repository.

Switch Transformers en Wudao 2.0

Although GPT-3’s step from 10x parameters to 175 billion parameters in 2020 was impressive, it is no longer the largest language model on the market. In early 2021, Google came up with Switch Transformers, a model of 1.6 trillion parameters. Switch Transformers didn’t get much media attention because it didn’t have a fancy AI demo, but a consumer product revolution may still be on the way.

On June 1, the Beijing Academy of Artificial Intelligence released a model with 1.75 trillion parameters, which again surpasses GPT-3 by a factor of 10 and is currently the largest language model. The model is capable of having conversations, can write poems and music, accepts images as input and can generate recipes. Much is not yet known about Wu Dao 2.0, although China plans to use it to train their first virtual student. Possibly with the hope of achieving AGI?

Fable

Although I have been somewhat skeptical about the possibilities of conversational AI, the Fable demo has changed my mind. Although this is the result of several years of intensive work with most likely a lot of venture capital, the capabilities are promising. With this video, Fable also wants to introduce the concept of “Virtual Beings” that could become part of our lives. For those interested, there is an annual virtual beings summit, which starts this year on 14/07. Interesting projects may emerge from this summit in the coming days.

This article belongs to the following project:

AI in Production

Artificial Intelligence (AI) is a buzzword that has become hard to ignore lately. Despite much attention to the subject at...