Reigns: A 1000 Years of Deterministic Babble

I gave a talk at ENJMIN 2020 about the gibberish-based procedural voiceover system we built for the 2016 game Reigns.

Transcript:

My talk is called "Reigns: A Thousand Years of Deterministic Babble". This talk it's about a voiceover system that we created. We're going to talk about what worked, what didn't work, and some of the things that I might have done differently.

So what is Reigns exactly? Reigns is a game where you rule a kingdom. Your advisors and those within your principality come to you making suggestions and things. And you have this interface that's almost like a dating game such as Tinder, where you swipe left or right to make a choice. All of your choices affect the state of things. And that's generally how it works. There's lots of different characters, different personalities.

This was one of my first opportunities to experiment with spoken dialogue in a game. I was the audio director on this project, and in the past I've worked on games where I've done all of the content creation. But this was the first time where I wanted to take a step back as far as content creation goes, and focus more on the overall vision for the game's audio. Given that, I knew that I wanted to have the characters speak in an unfamiliar language - something that would that would add character to the experience. I was definitely thinking about games like "Animal Crossing" and "The Sims", and I wanted to come up with a unique and systemic approach to see if we could see give it a little bit of a sense of intelligence or immersion. And in thinking about this type of approach to voiceover, and how we could maybe push it a little bit further, my mind naturally went to towards popular examples of creative languages, especially from television, movies and literature. I thought about Klingon from Star Trek, Elvish from The Lord of the Rings, or Dothraki from Game of Thrones.

These types of examples are called Conlangs, for constructed language. These languages are fully functional and quite impressive. But that type of approach seemed like overkill for a fun and light game on a short timeline - a game that was was developed in under a year. Building a conlang is hard work and I'm not exactly a linguist, so I wanted to do something a little bit more streamlined, like making a bunch of random sounds with your mouth. I mean, that's pretty easy, right? That seemed like a good place to start.

Of course, if you have something that has absolutely no structure at all, it's just noise. And so we tried to come up with some ways to remedy that.

I think it works pretty well. I mean, I would say that, you know, this sort of gibberish, it lacks the depth that you would get from a conlang, but it still manages to personify the game and the characters. What we really tried to set out to do, is to take something that's kind of chaotic, where we're putting in all these different inputs and find a way to have some sort of a controlled chaos ... something that has a little bit of structure to it, to just make it work a little bit better.

On our project, we had a three person audio team. I was audio director and we had a dedicated sound designer and composer. We all worked together in developing the system. I sort of led the charge, but we all pitched in as far as recording voices and things like that.

Recording Process

Let's talk a little bit about the recording process and our approach for how we structured the voice over. We designed about 20 unique voices for the game and the voice actors were ourselves and some of our friends. It was fun and really inspiring how quickly the voices would come together. We settled on a novel approach where we would start with the seed phrase, something like this one: "money banana stand". And we'd ask the voice actor to riff on this phrase and to embody a persona of one of the characters in the game. In this case, we linked this phrase with this character, "The Jester". We'd ask the actor to riff in a stream of consciousness way, keeping in mind the sounds associated with this phrase. It was really fun to do this. Having a seed phrase like this, I found that it made the performances more focused, more intentional. The gibberish felt like a unit, almost sounding like a language of sorts.

Here's another example: "quantization prerogative". Don't ask me what that means, I have no idea. That's just a seed phrase we came up with for this character who's a nefarious magician in the dark arts. That one's actually me, and the other one was was the composer. And so we would take these performances, the ones that we liked, and we'd chop them up into sets of anywhere from 30 to 90 assets in some cases. We would we would delete the the fragments that stood out and in a bad way or were redundant - sometimes, you'd have repeats of certain things. The assets themselves were usually one or two syllables, sometimes three or four. We found that having a blend of those really helped the system to work as well as it does. When it's all single syllable sounds, connecting them together becomes a lot more difficult in the sense that you lose some of the human element. It starts to sound a little more robotic. It starts to sound a little bit more like "Animal Crossing", which is not bad. It's actually really cool. It's just a different style. And when you have multisyllabic fragments, and with the way that the human voice connects sounds together, you hear a bit more of an emotional element in the speech.

These were things we kept in mind. Also cutting out hard syllables seems seemed to work really well. So something like "Brero" or "Di", in having those hard transients, it really helped with connecting fragments together that may have been pulled from different sections of the recording. Even vowels could be used as hard syllables in some cases, depending on the performance.

Implementation Process

Ok, so let's talk about the implementation process now. We have these recordings of different voices, and we've chopped them up into these little fragments. Now we have to figure out how to actually put them together in a way that sounds reasonable.

Let's take this character as an example, "Lady Gray". Here's a card with some text from the game. Thinking about how to implement this, we started thinking about how we can differentiate the characters. What can we do and what will some of the parameters be that will help us to do that? And so this is what we came up with, it's a short list of parameters, but they all give us a certain amount of variability and control. So it's simple, but enough levers to try to separate the characters from each other. So we have voice type, for which set of recorded fragments we're going to use for this character. Pitch, for whether we want to make this character's voice a little slower and deeper, or maybe a little higher and thinner. This was just a nice element to have after recording the voices. In some cases we wanted to tweak them a bit. We had about 20 voice sets that we created, but there were more characters than that. So we ended up using a single voice set for multiple characters in some cases. And so pitch was a nice way to differentiate those usages.

Resonant frequency was a parametric EQ band, essentially picking a frequency to boost or drop, in order to adjust the timbre of a voice. And then fragment overlap size, which is basically about figuring out how much distance to put between our voice fragments. Sometimes we'd use a negative distance to get the fragments closer together. As far as a global parameter goes, we had to figure out, "OK, how long do we want these performances to be?" So we had to come up with a text to speech ratio for duration based on how much text there is. How much speech should there be and how long should it be? It was important to maintain a certain flow for the gameplay. People could be swiping through cards relatively quickly. And so we didn't want the audio experience to be getting cut off constantly by the users' natural way of playing the game. The speech should never get too long. It should never be more than a couple of seconds. But if this text is short, then the speech should be shorter to reflect that. So the question was, how long should this performance be?

What's going to feel natural? And the answer was pretty obvious. It's n/55. OK, maybe, this isn't that obvious, but this is what worked. And what we're talking about here is length in time of the speech. So length = n/55. So what's n? N is the character count. So it's the number of characters in this card. And 55, what is 55? It's nothing. It's an arbitrary number ... trial and error is what yielded this formula, where length equals character count divided by fifty five. And so for this example, the character count is eighty seven. There's eighty seven characters in the text of this card and this formula works out to about 1.58 seconds. Given the card text limit, the duration will never get too long. So this, this works out well.

Now that we have our formula, the way the system works is we take the card text and use it as a seed. We use the seed to deterministically generate random values. And then we use those random values to select which voice fragments to play. And we'll do this as many times as it takes to reach the desired speech length, which in this case was 1.58 seconds.

In this example, we have three fragments, "anats", "bnanda" and "UsTAH!". This is just a made up example. But between those and the fragment overlap amount for this character, which is a negative overlap that brings them closer together, in total that puts us past the desired length. This boundary that we're using is actually a soft boundary.

Going over a bit on time is OK because we're in the ballpark of what we want. And we'll always play at least one fragment, which is important because if it's a really short piece of text, you don't want it to play nothing at all. Because it's seeded by the text, the cards always trigger the same speech fragments every time, which is neat. It makes playback reproducible and it makes it easier to test. One of the hypotheses was "maybe this feels a little less random?". It's hard to say for sure, but that was the intent. As far as making the performances feel more natural, having the overlapping fragments really allowed us to dial in speaking styles. Maybe for certain sets of voices, the speech is slower and we want to adjust the overlap accordingly to match the personality of the character.

The last thing that we did, is to always put the longest fragment that we chose at the end. We found that this just sounded better. It sounded a little bit more natural. And I think a lot of that has to do with the way that we chopped up these performances. A single syllable that gets chopped up tends to be at the beginning or in the middle of a phrase. If it's a multisyllabic fragment and therefore a longer length, it tends to have been taken from the end of a spoken phrase. And so those just naturally sound seem to sound better at the end of these stitched together performances. That's basically how the system works.

Postmortem

I want to talk a little bit about what went right and what went wrong with this system and some future ideas about how to improve on it. I think overall, we all felt pretty good about it. Stitching variable size fragments together works well.

As far as some categories that we can talk about, we can start with language. I think we could have done a bit more to tie the gibberish together. What if everyone spoke the same strain of gibberish? That might have felt more intelligent as opposed to having all these different strains. Although, to be fair that does differentiate the characters a bit. I think that might have been sort of a trade off there, but definitely something worth thinking about.

And then there's performance. I think we could have hired more actors, we could have recorded more voices, of course, and maybe took it a bit more seriously. We were having a lot of fun with it, but I think if we at honed in a little bit on the direction we might have been able to improve the results somewhat. We just went about it in a loose way. I would record a couple of voices,the sound designer would record a couple of voices on his end ... there wasn't a super unified type of approach.

The thing that really drove the system, the deterministic method that we came up with, I think we felt was interesting. It's good for testing because it's reproducible, but honestly, it's not that noticeable for people. And I think because of that, it kind of undercuts the design, because the intention was for it to really lend this sense of embodiment and intelligence to the speech. It doesn't quite hit that mark, I don't think.

Overall, I would say it's a successful as a proof of concept. But between all three categories, I think we could have done a bit more to create a sense of immersion and intelligence with the system. I think if I could do it again, I would try to find a more effective deterministic method, something that would add another layer as far as differentiating the characters, making it feel more like a shared culture between everyone since they're all living in and around your kingdom.

Syllable Based Seeding

Perhaps a system that seeds based on syllables instead of paragraphs length would have given us a better way to do these sorts of things. I actually built such a system that went unused for another project, but it worked really well. The idea was that you would map syllables of text directly to the syllable recordings, using a similar deterministic method. What that would do is give the invented language a more coherent sound, and your experience of it would feel more intelligent. If you had a repeated segment of speech like, "ho ho" or "no, no, no!" the system would reflect that with repetition. That made it feel even more like a real language. Or you might pick up on themes of conversation if multiple characters are talking about some particular theme and they keep using certain words. You might actually be able to pick up on those things. In a subtle way over time, you might start to internalize a sense of this language, even though it's completely fabricated and doesn't really have any function.

Machine Learning

There's been amazing advances in speech synthesis, just recently there was a paper where now your voice can be almost completely replicated with only five seconds of your recorded speech, which is scary and amazing. I think something along these lines could lend itself to a really interesting game implementation for procedurally generating speech. And I'm really excited to see something like that happen in the future. I think we've barely touched the surface of what's possible for inventive language and games. It's a really interesting space.

Most of the gibberish voiceover that I've seen to this point in games has been largely based on aesthetic. And I think it would be interesting to see more investigation into ways of making these inventive languages have a sense of function, even if they don't really have any function. I'm excited to see what we come up with.

Q: Did chasing the fun help ?

I think so. I think the fun is what really drove the process, it was the idea of doing something new personally for myself, something I'd never done before, but also seeing how we can take some of the touchstones for this kind of thing, like "The Sims" and "Animal Crossing" and do something different. Certainly, it can be a technical rabbit hole. You can get really deep with it. In some of my other experimentations I've found it can be really challenging to get good results. For instance, systems that are based on single syllables. That is a really hard thing to do, to combine single syllable sounds into speech that sounds pleasing and doesn't just sound like a robot. It's really hard without advanced technology. I think ultimately it was more of an artful process for me than an intellectual one. And in that manner, at times it was about picking the path of least resistance, picking the thing that was going to give us the most bang for our buck and not get us stuck in a technological trap.

Q: Did you consider applying these concepts to Hyper Light Drifter ?

No, I don't think we thought about that. I'm sure it could have worked, but there was something nice about just having those glyphs and also is really challenging, I think, from a design perspective. Occasionally you'd also have these storyboards that were meant to convey information too ... that to me felt like a different thing. It's possible that we could have done something with with sound. But I think it accomplished what it was trying to do without it.

Q: Can you talk more about your role as Audio Director on Reigns ?

I the first audio person brought onto the project, and so in that way it kind of fell upon me to make some suggestions about about how to move forward. I think originally the expectation and the intention was that I would do the music. But as I got into it, I realized that I was a little bit more interested in the systems as opposed to creating content, and that I would open it up to some other people to get involved. I would just focus on supporting them and focusing on the systems. And so that's really where I started. I was interested in this voiceover system, and I was also interested in a music system that we built for this game, which is also a type of phrase based system. It used four part voice leading ... soprano, alto, tenor, bass. And those four parts were mapped to the four categories that you're trying to manage in the game, religion, the military, the people and the treasury. So each one of those voices is mapped to those in some fairly straightforward senses like driving part volume. But the system also had other challenges, such as figuring out how to transition through the many different phases of the game.

Q : Does localized text generate different voiceover ?

I'm actually not sure ! That's something we should look into. It'd be interesting to see if it's different. That would be appropriate, I'd say.

Q : What did you learn as audio director ?

It was a really good experience for me to to be in that role, because it helped me to learn that I really don't like managing people at all. But I also learned that I really like getting involved with building systems. I find it really interesting and I've already done a lot of systems work since then. I've been working with Dino Polo Club out of New Zealand for the last six odd years, working on Mini Metro and Mini Motorways, and have been done lots of interesting sound work for those games. On Solar Ash, I've been doing a lot of systems work as well. I've always had a bit of an itch for the technical problem solving and coming up with novel approaches to things. Between that and learning I dislike working in a managerial role, it was a great learning experience. Actually at Heart Machine, that was a role that was mine in theory, if I wanted it. But we ended up bringing in somebody else to take on that role, because I didn't want to do it.

Q : How would you recommend applying this technique to an in-game radio channel ?

It's hard to say. A radio program is a totally different form factor. When you don't have the context that you get from visuals, then you're really asking a lot from gibberish. I think as a result of that, the gibberish would be more effective if the context was really dialed in on the audio side. And for that, I think about "The Sims", because of the way that they use their language "Simlish". They tend to really dress it up in the context of what it's trying to do. For a radio show, you have all the bells and whistles that go along with that - the tone of the voices, the rhythm of the performances ... maybe there's music and little sound effects and things that all kind of contribute to that. It also depends on what your goals are with the gibberish. Are you trying to get people to actually understand the content, or is it more just about an impression and creating a feeling? I think all of those things really matter as far as how you would go about something. So I would keep those in mind. It's hard to give specific suggestions without knowing what the ultimate goals were, but that's what I would say.

Presentation: Abracadata!

I gave a microtalk at GDC 2018 as part of a session at the Artificial Intelligence Summit called 'Turing Tantrums: Devs Rant!'. I shared a thought experiment about exploring the possibility space of abstracted data relationships that cross disciplinary boundaries. Unlikely data marriages!

Transcript:

As a bit of an outsider I thought instead of ranting, it might be better to > share a thought experiment around an area of interest for me lately ...

AbracaDATA!

Games are a treasure trove of data.

A lot of what happens in games has something to do with numbers and math, and this stuff is great for creating and reinforcing the internal relationships in a game.

The most common relationship is player input, and how it drives just about everything. But let’s focus elsewhere.

Maybe your game has an enemy. A blue rectangle OH NO! It’s being shuffled around the world with some movement instructions. And you spruce it up with an animation, and maybe it feels good with some tweaking, but if physics and animation share their data, maybe they make even better decisions. Also that bearded man is now a giant dog thing.

Of course, you may not want all yours systems to share, and sometimes a hand-authored, isolated thing might be what you need. But tying camera movements to explosions or using text to drive a gibberish voiceover are examples of ways that tentacular relationships can improve the way a game feels.

Speaking of gibberish, I spend a lot of time thinking about sound, and lately I’ve been thinking about data sonification: using a data input to generate a sonic output. When a gameplay event occurs, like a footstep, we like to trigger a sound. This is a very useful and simple form of data sonification.

Another common practice is to map an object’s spatial position visually, to a relative spatial position, sonically. For instance, mapping the X position on screen to the stereo pan position of a sound, or mapping an object’s distance from the camera to a sound’s volume and brightness.

These mimic the way we hear things in the real world and are simple victories. But the examples I’ve given so far are well known and commonly employed. They’re perfect for clarifying, giving the player more coordinated feedback about what they’re interacting with.

But I want to talk about some of the less utilitarian places these relationships can go. Why not do more to re-contextualize this data instead? We could springboard ourselves into explorations of relationships that are weird, novel, counterintuitive and wonderfully asymmetric.

Here’s a silly one. There’s a game called ‘Sonic Dreams Collection’, where changing the size of the game window on the main menu changes the pitch and speed of the music. But what if it went beyond that? Suddenly you might care about window sizes in this strange new context, and it might elicit a reaction normally not reserved for the size of your window…

Or what if you tied the movement pattern of the ripples of a nearby river to the hair physics of your player, but only inject the data as you move away from it? What is this environment trying to evoke? Negative magnetism? (what does that even mean)

Finding meaning here can be a bit like trying to parse through a tarot card reading. You draw some random sources and try to map meaning onto their relationship.

Maybe the gag about window size didn’t inspire a deeper search for meaning, but what about a more opaque & esoteric data abstraction ? You might experience it as a kind of intelligence. And we could employ these relationships in subtle but cumulative ways.

Bees, for instance, perform a figure-8 called the ‘waggle dance’ that relays important locational info to other bees. You could create a dumb version of this cooperative relationship using abstracted data, and employ it within a system of similar objects. Maybe the relationship relies on distance between the objects, so that when they get close, they appear to share information with each other through sound or movement.

As worldbuilders, we could hint at a deeper ecology, through layers of data abstraction that might seem cooperative, adversarial, emergent, or mysterious and difficult to verbalize. We can suggest that with or without the player, the actors in this ecosystem are hopelessly entangled, and will carry on with their ebb and flow, just like we all do. Could be cool ...

The nice thing is that unlike the "waggle dance", you don’t need to prove it out with science. Maybe even the most arbitrary data relationship could feel like real intelligence if it’s been sufficiently abstracted. Players will conjure up their own interpretations, they like to do that. So you just need to convince them that they are experiencing something meaningful.

In other words, I think that by making different parts of a game communicate and share information in non-traditional ways, we can emulate the vitality we experience from real intelligence, and as a result it may be possible to manufacture a deeper sense of meaning and causality.

And the more liberal the different parts are in communicating with unlikely partners, the more things may start to get downright ecological.

And an interesting ecology of data relationships would probably have different kinds at play ... opaque ones, transparent ones, those that seem arbitrary, those that are rational, the esoteric (opaque+arbitrary), the absurd (transparent+arbitrary), the accessible (transparent+rational), the intelligent? (opaque+rational) (i totally made this up)

I think the abstraction and recontextualization of data can lead to all sorts of results. But if we sense there are meaningful relationships of cause and effect at play, that could lead us to suppose there is intelligence, and that could bring more depth to our experience.

So give it a shot! Things will definitely happen if you let your systems co-mingle.

You could let the volume level of a creature’s mating call drive the probability that other creatures respond in kind.

You could light a room using the average color of the last 60 frames.

You could take the wave propagation system used to drive visual wind FX and map it to the size of an NPC’s shoes.

But in any case,

AbracaDATA!

Or perhaps … Abraca...dada?(ism)

Link: GDC Vault: 'Turing Tantrums! AI Devs Rant'

Presentation: Serialism & Sonification in Mini Metro

I gave a talk at GDC 2018 as part of a session at the Artificial Intelligence Summit called 'Beyond Procedural Horizons'. I talk about how we combined data sonification and concepts taken from the musical approach known as Serialism to build a soundscape for Mini Metro.

Transcript:

Serialism & Sonification in Mini Metro

or, how we avoided using looping music tracks completely, by using sequential sets of data to generate music and sound.

Serialism

In music, there is a technique called Serialism that uses sequential sets of data (known as series), set about on different axes of sound (pitch, volume, rhythm, note duration, etc), working together to create music.

In Mini Metro, we apply this concept by using internal data from the game and externally authored data in tandem to generate the music.

You might have noticed that the game has a clock - the game is broken up into time increments that are represented as hours, days and weeks. (though of course faster) And before diving in, it’s important to know that we derive our music tempo from the duration of...

1 In-Game Hour = 0.8 secs = 1 beat @ 72 bpm = our master pulse.

We use this as our standard unit of measurement for when to trigger sounds. In other words, most of the sounds in the game are triggered periodically, using fractional durations of 0.8 seconds.

In Mini Metro, the primary mode of authorship lies in drawing and modifying metro lines. They’re also the means in which everything is connected. They serve as the foundation upon which the soundscape of pitches and rhythms is designed. The lines are represented by a unique stream of music generated using data from different sources.

The simplest way to describe this system is that each metro line is represented by a musical sequence of pulses, triggered at a constant rate with a constant pitch. This rate and pitch is constant until they are shifted by a change in gameplay. Each metro station represents one pulse in that sequence. Each pulse has some unique properties, such as volume, timbre, and panning, and these are calculated using game data. Other properties still are inherited from lower-levels of abstraction, namely unique loadouts for each game level. Some levels tell the pulses to fade in gradually, other levels might tell the pulses to trigger using a swing groove instead of a constant rate, and the levels are differentiated in other ways as well.

[1, 2, 3, 4, 6]

All of this musical generation is done using sets of data. And referring back to Serialism, the data is quite often sequential, in series. These numbers actually represent multiples of time fragments. Or, to put it more simply, rhythms. In the case of rhythms and pitches, the data is authored, so we have more control over what kind of mood we’d like to evoke. This data is cycled through in different ways during gameplay to generate music.

So we’ve got some authored data generating musical sequences, but what about using game data? Ideally we could sonify it to give the player some useful feedback about what is happening.

Combinations of game and authored data are found throughout Mini Metro’s audio system. There’s lots of game data and authored data being fed into the system, working in tandem. Authored data is often used to steer things in a musical direction, while game data is used to more closely marry things to gameplay. In some cases, authored data even made its way into other areas of the game. Certain game behaviors like the passenger spawning were retrofitted to fire using rhythm assignments specified by the sound system.

You might ask why go through the trouble to do things this way? Well, it is really fun. But beyond that there are a variety of answers and I could go into a lot of depth about it, but I think the most important reasons are:

Immediacy & Embodiment

Immediate feedback is often reserved for sound effects and not music, and immediate feedback in music can feel forced if not handled correctly. This type of system allows us to bring this idea of immediacy into the music in a way that feels natural.

The granularity of the system allows the soundscape to respond to the game state instantaneously and evolve with the metro system as it grows, and a holistic system handles all of the gradation for you. When your metro is smaller and less busy, the sound is smaller and less busy. As your metro grows and gets more complex, so does the music and sound to reflect that. When you accelerate the simulation, the music and sound of your metro accelerates. When something new is introduced into your system, you’re not only notified up front, but also regularly over time as its sonic representation becomes a part of the ambient tapestry.

Embodiment

And this all (hopefully) ties into a sense of embodiment. Because all of these game objects have sounds that trigger in a musical way, and all use a shared rhythmic language that is cognizant of the game clock, and use game data to further tie them to what is actually happening in the game, things start to feel communal and unified.

It’s an ideal more than a guarantee, but if executed well, I think you can start to approach something akin to a holistic experience for the player.

Thanks!

Link: GDC Vault: 'Beyond Procedural Horizons'