I gave a talk at ENJMIN 2020 about the gibberish-based procedural voiceover system we built for the 2016 game Reigns.
Transcript:
My talk is called "Reigns: A Thousand Years of Deterministic Babble". This talk it's about a voiceover system that we created. We're going to talk about what worked, what didn't work, and some of the things that I might have done differently.
So what is Reigns exactly? Reigns is a game where you rule a kingdom. Your advisors and those within your principality come to you making suggestions and things. And you have this interface that's almost like a dating game such as Tinder, where you swipe left or right to make a choice. All of your choices affect the state of things. And that's generally how it works. There's lots of different characters, different personalities.
This was one of my first opportunities to experiment with spoken dialogue in a game. I was the audio director on this project, and in the past I've worked on games where I've done all of the content creation. But this was the first time where I wanted to take a step back as far as content creation goes, and focus more on the overall vision for the game's audio. Given that, I knew that I wanted to have the characters speak in an unfamiliar language - something that would that would add character to the experience. I was definitely thinking about games like "Animal Crossing" and "The Sims", and I wanted to come up with a unique and systemic approach to see if we could see give it a little bit of a sense of intelligence or immersion. And in thinking about this type of approach to voiceover, and how we could maybe push it a little bit further, my mind naturally went to towards popular examples of creative languages, especially from television, movies and literature. I thought about Klingon from Star Trek, Elvish from The Lord of the Rings, or Dothraki from Game of Thrones.
These types of examples are called Conlangs, for constructed language. These languages are fully functional and quite impressive. But that type of approach seemed like overkill for a fun and light game on a short timeline - a game that was was developed in under a year. Building a conlang is hard work and I'm not exactly a linguist, so I wanted to do something a little bit more streamlined, like making a bunch of random sounds with your mouth. I mean, that's pretty easy, right? That seemed like a good place to start.
Of course, if you have something that has absolutely no structure at all, it's just noise. And so we tried to come up with some ways to remedy that.
I think it works pretty well. I mean, I would say that, you know, this sort of gibberish, it lacks the depth that you would get from a conlang, but it still manages to personify the game and the characters. What we really tried to set out to do, is to take something that's kind of chaotic, where we're putting in all these different inputs and find a way to have some sort of a controlled chaos ... something that has a little bit of structure to it, to just make it work a little bit better. On our project, we had a three person audio team. I was audio director and we had a dedicated sound designer and composer. We all worked together in developing the system. I sort of led the charge, but we all pitched in as far as recording voices and things like that.
Recording Process
Let's talk a little bit about the recording process and our approach for how we structured the voice over. We designed about 20 unique voices for the game and the voice actors were ourselves and some of our friends. It was fun and really inspiring how quickly the voices would come together. We settled on a novel approach where we would start with the seed phrase, something like this one: "money banana stand". And we'd ask the voice actor to riff on this phrase and to embody a persona of one of the characters in the game. In this case, we linked this phrase with this character, "The Jester". We'd ask the actor to riff in a stream of consciousness way, keeping in mind the sounds associated with this phrase. It was really fun to do this. Having a seed phrase like this, I found that it made the performances more focused, more intentional. The gibberish felt like a unit, almost sounding like a language of sorts.
Here's another example: "quantization prerogative". Don't ask me what that means, I have no idea. That's just a seed phrase we came up with for this character who's a nefarious magician in the dark arts. That one's actually me, and the other one was was the composer. And so we would take these performances, the ones that we liked, and we'd chop them up into sets of anywhere from 30 to 90 assets in some cases. We would we would delete the the fragments that stood out and in a bad way or were redundant - sometimes, you'd have repeats of certain things. The assets themselves were usually one or two syllables, sometimes three or four. We found that having a blend of those really helped the system to work as well as it does. When it's all single syllable sounds, connecting them together becomes a lot more difficult in the sense that you lose some of the human element. It starts to sound a little more robotic. It starts to sound a little bit more like "Animal Crossing", which is not bad. It's actually really cool. It's just a different style. And when you have multisyllabic fragments, and with the way that the human voice connects sounds together, you hear a bit more of an emotional element in the speech.
These were things we kept in mind. Also cutting out hard syllables seems seemed to work really well. So something like "Brero" or "Di", in having those hard transients, it really helped with connecting fragments together that may have been pulled from different sections of the recording. Even vowels could be used as hard syllables in some cases, depending on the performance.
Implementation Process
Ok, so let's talk about the implementation process now. We have these recordings of different voices, and we've chopped them up into these little fragments. Now we have to figure out how to actually put them together in a way that sounds reasonable.
Let's take this character as an example, "Lady Gray". Here's a card with some text from the game. Thinking about how to implement this, we started thinking about how we can differentiate the characters. What can we do and what will some of the parameters be that will help us to do that? And so this is what we came up with, it's a short list of parameters, but they all give us a certain amount of variability and control. So it's simple, but enough levers to try to separate the characters from each other. So we have voice type, for which set of recorded fragments we're going to use for this character. Pitch, for whether we want to make this character's voice a little slower and deeper, or maybe a little higher and thinner. This was just a nice element to have after recording the voices. In some cases we wanted to tweak them a bit. We had about 20 voice sets that we created, but there were more characters than that. So we ended up using a single voice set for multiple characters in some cases. And so pitch was a nice way to differentiate those usages.
Resonant frequency was a parametric EQ band, essentially picking a frequency to boost or drop, in order to adjust the timbre of a voice. And then fragment overlap size, which is basically about figuring out how much distance to put between our voice fragments. Sometimes we'd use a negative distance to get the fragments closer together. As far as a global parameter goes, we had to figure out, "OK, how long do we want these performances to be?" So we had to come up with a text to speech ratio for duration based on how much text there is. How much speech should there be and how long should it be? It was important to maintain a certain flow for the gameplay. People could be swiping through cards relatively quickly. And so we didn't want the audio experience to be getting cut off constantly by the users' natural way of playing the game. The speech should never get too long. It should never be more than a couple of seconds. But if this text is short, then the speech should be shorter to reflect that. So the question was, how long should this performance be?
What's going to feel natural? And the answer was pretty obvious. It's n/55. OK, maybe, this isn't that obvious, but this is what worked. And what we're talking about here is length in time of the speech. So length = n/55. So what's n? N is the character count. So it's the number of characters in this card. And 55, what is 55? It's nothing. It's an arbitrary number ... trial and error is what yielded this formula, where length equals character count divided by fifty five. And so for this example, the character count is eighty seven. There's eighty seven characters in the text of this card and this formula works out to about 1.58 seconds. Given the card text limit, the duration will never get too long. So this, this works out well.
Now that we have our formula, the way the system works is we take the card text and use it as a seed. We use the seed to deterministically generate random values. And then we use those random values to select which voice fragments to play. And we'll do this as many times as it takes to reach the desired speech length, which in this case was 1.58 seconds.
In this example, we have three fragments, "anats", "bnanda" and "UsTAH!". This is just a made up example. But between those and the fragment overlap amount for this character, which is a negative overlap that brings them closer together, in total that puts us past the desired length. This boundary that we're using is actually a soft boundary.
Going over a bit on time is OK because we're in the ballpark of what we want. And we'll always play at least one fragment, which is important because if it's a really short piece of text, you don't want it to play nothing at all. Because it's seeded by the text, the cards always trigger the same speech fragments every time, which is neat. It makes playback reproducible and it makes it easier to test. One of the hypotheses was "maybe this feels a little less random?". It's hard to say for sure, but that was the intent. As far as making the performances feel more natural, having the overlapping fragments really allowed us to dial in speaking styles. Maybe for certain sets of voices, the speech is slower and we want to adjust the overlap accordingly to match the personality of the character.
The last thing that we did, is to always put the longest fragment that we chose at the end. We found that this just sounded better. It sounded a little bit more natural. And I think a lot of that has to do with the way that we chopped up these performances. A single syllable that gets chopped up tends to be at the beginning or in the middle of a phrase. If it's a multisyllabic fragment and therefore a longer length, it tends to have been taken from the end of a spoken phrase. And so those just naturally sound seem to sound better at the end of these stitched together performances. That's basically how the system works.
Postmortem
I want to talk a little bit about what went right and what went wrong with this system and some future ideas about how to improve on it. I think overall, we all felt pretty good about it. Stitching variable size fragments together works well.
As far as some categories that we can talk about, we can start with language. I think we could have done a bit more to tie the gibberish together. What if everyone spoke the same strain of gibberish? That might have felt more intelligent as opposed to having all these different strains. Although, to be fair that does differentiate the characters a bit. I think that might have been sort of a trade off there, but definitely something worth thinking about.
And then there's performance. I think we could have hired more actors, we could have recorded more voices, of course, and maybe took it a bit more seriously. We were having a lot of fun with it, but I think if we at honed in a little bit on the direction we might have been able to improve the results somewhat. We just went about it in a loose way. I would record a couple of voices,the sound designer would record a couple of voices on his end ... there wasn't a super unified type of approach.
The thing that really drove the system, the deterministic method that we came up with, I think we felt was interesting. It's good for testing because it's reproducible, but honestly, it's not that noticeable for people. And I think because of that, it kind of undercuts the design, because the intention was for it to really lend this sense of embodiment and intelligence to the speech. It doesn't quite hit that mark, I don't think.
Overall, I would say it's a successful as a proof of concept. But between all three categories, I think we could have done a bit more to create a sense of immersion and intelligence with the system. I think if I could do it again, I would try to find a more effective deterministic method, something that would add another layer as far as differentiating the characters, making it feel more like a shared culture between everyone since they're all living in and around your kingdom.
Syllable Based Seeding
Perhaps a system that seeds based on syllables instead of paragraphs length would have given us a better way to do these sorts of things. I actually built such a system that went unused for another project, but it worked really well. The idea was that you would map syllables of text directly to the syllable recordings, using a similar deterministic method. What that would do is give the invented language a more coherent sound, and your experience of it would feel more intelligent. If you had a repeated segment of speech like, "ho ho" or "no, no, no!" the system would reflect that with repetition. That made it feel even more like a real language. Or you might pick up on themes of conversation if multiple characters are talking about some particular theme and they keep using certain words. You might actually be able to pick up on those things. In a subtle way over time, you might start to internalize a sense of this language, even though it's completely fabricated and doesn't really have any function.
Machine Learning
There's been amazing advances in speech synthesis, just recently there was a paper where now your voice can be almost completely replicated with only five seconds of your recorded speech, which is scary and amazing. I think something along these lines could lend itself to a really interesting game implementation for procedurally generating speech. And I'm really excited to see something like that happen in the future. I think we've barely touched the surface of what's possible for inventive language and games. It's a really interesting space. Most of the gibberish voiceover that I've seen to this point in games has been largely based on aesthetic. And I think it would be interesting to see more investigation into ways of making these inventive languages have a sense of function, even if they don't really have any function. I'm excited to see what we come up with.
Q: Did chasing the fun help?
I think so. I think the fun is what really drove the process, it was the idea of doing something new personally for myself, something I'd never done before, but also seeing how we can take some of the touchstones for this kind of thing, like "The Sims" and "Animal Crossing" and do something different. Certainly, it can be a technical rabbit hole. You can get really deep with it. In some of my other experimentations I've found it can be really challenging to get good results. For instance, systems that are based on single syllables. That is a really hard thing to do, to combine single syllable sounds into speech that sounds pleasing and doesn't just sound like a robot. It's really hard without advanced technology. I think ultimately it was more of an artful process for me than an intellectual one. And in that manner, at times it was about picking the path of least resistance, picking the thing that was going to give us the most bang for our buck and not get us stuck in a technological trap.
Q: Did you consider applying these concepts to Hyper Light Drifter?
No, I don't think we thought about that. I'm sure it could have worked, but there was something nice about just having those glyphs and also is really challenging, I think, from a design perspective. Occasionally you'd also have these storyboards that were meant to convey information too ... that to me felt like a different thing. It's possible that we could have done something with with sound. But I think it accomplished what it was trying to do without it.
Q: Can you talk more about your role as Audio Director on Reigns?
I the first audio person brought onto the project, and so in that way it kind of fell upon me to make some suggestions about about how to move forward. I think originally the expectation and the intention was that I would do the music. But as I got into it, I realized that I was a little bit more interested in the systems as opposed to creating content, and that I would open it up to some other people to get involved. I would just focus on supporting them and focusing on the systems. And so that's really where I started. I was interested in this voiceover system, and I was also interested in a music system that we built for this game, which is also a type of phrase based system. It used four part voice leading ... soprano, alto, tenor, bass. And those four parts were mapped to the four categories that you're trying to manage in the game, religion, the military, the people and the treasury. So each one of those voices is mapped to those in some fairly straightforward senses like driving part volume. But the system also had other challenges, such as figuring out how to transition through the many different phases of the game.
Q : Does localized text generate different voiceover?
I'm actually not sure! That's something we should look into. It'd be interesting to see if it's different. That would be appropriate, I'd say.
Q : What did you learn as audio director?
It was a really good experience for me to to be in that role, because it helped me to learn that I really don't like managing people at all. But I also learned that I really like getting involved with building systems. I find it really interesting and I've already done a lot of systems work since then. I've been working with Dino Polo Club out of New Zealand for the last six odd years, working on #Mini Metro and #Mini Motorways, and have been done lots of interesting sound work for those games. On #Solar Ash, I've been doing a lot of systems work as well. I've always had a bit of an itch for the technical problem solving and coming up with novel approaches to things. Between that and learning I dislike working in a managerial role, it was a great learning experience. Actually at Heart Machine, that was a role that was mine in theory, if I wanted it. But we ended up bringing in somebody else to take on that role, because I didn't want to do it.
Q : How would you recommend applying this technique to an in-game radio channel?
It's hard to say. A radio program is a totally different form factor. When you don't have the context that you get from visuals, then you're really asking a lot from gibberish. I think as a result of that, the gibberish would be more effective if the context was really dialed in on the audio side. And for that, I think about "The Sims", because of the way that they use their language "Simlish". They tend to really dress it up in the context of what it's trying to do. For a radio show, you have all the bells and whistles that go along with that - the tone of the voices, the rhythm of the performances ... maybe there's music and little sound effects and things that all kind of contribute to that. It also depends on what your goals are with the gibberish. Are you trying to get people to actually understand the content, or is it more just about an impression and creating a feeling? I think all of those things really matter as far as how you would go about something. So I would keep those in mind. It's hard to give specific suggestions without knowing what the ultimate goals were, but that's what I would say.