Mini Metro: Building Your Own City Soundscape

Dino Polo Club recently released the Workshop update for Mini Metro, which allows users to create their own maps for the game. To further help with this I've created a tutorial video and a toolkit. The toolkit contains a written guide, as well as a JSON schema set up for Visual Studio Code by default. By using this schema in tandem with Visual Studio Code, you have access to error linting, autocomplete, and property descriptions.

Link: Mini Metro Audio Toolkit (Github)

Reigns: A 1000 Years of Deterministic Babble

I gave a talk at ENJMIN 2020 about the gibberish-based procedural voiceover system we built for the 2016 game Reigns.

Transcript:

My talk is called "Reigns: A Thousand Years of Deterministic Babble". This talk it's about a voiceover system that we created. We're going to talk about what worked, what didn't work, and some of the things that I might have done differently.

So what is Reigns exactly? Reigns is a game where you rule a kingdom. Your advisors and those within your principality come to you making suggestions and things. And you have this interface that's almost like a dating game such as Tinder, where you swipe left or right to make a choice. All of your choices affect the state of things. And that's generally how it works. There's lots of different characters, different personalities.

This was one of my first opportunities to experiment with spoken dialogue in a game. I was the audio director on this project, and in the past I've worked on games where I've done all of the content creation. But this was the first time where I wanted to take a step back as far as content creation goes, and focus more on the overall vision for the game's audio. Given that, I knew that I wanted to have the characters speak in an unfamiliar language - something that would that would add character to the experience. I was definitely thinking about games like "Animal Crossing" and "The Sims", and I wanted to come up with a unique and systemic approach to see if we could see give it a little bit of a sense of intelligence or immersion. And in thinking about this type of approach to voiceover, and how we could maybe push it a little bit further, my mind naturally went to towards popular examples of creative languages, especially from television, movies and literature. I thought about Klingon from Star Trek, Elvish from The Lord of the Rings, or Dothraki from Game of Thrones.

These types of examples are called Conlangs, for constructed language. These languages are fully functional and quite impressive. But that type of approach seemed like overkill for a fun and light game on a short timeline - a game that was was developed in under a year. Building a conlang is hard work and I'm not exactly a linguist, so I wanted to do something a little bit more streamlined, like making a bunch of random sounds with your mouth. I mean, that's pretty easy, right? That seemed like a good place to start.

Of course, if you have something that has absolutely no structure at all, it's just noise. And so we tried to come up with some ways to remedy that.

I think it works pretty well. I mean, I would say that, you know, this sort of gibberish, it lacks the depth that you would get from a conlang, but it still manages to personify the game and the characters. What we really tried to set out to do, is to take something that's kind of chaotic, where we're putting in all these different inputs and find a way to have some sort of a controlled chaos ... something that has a little bit of structure to it, to just make it work a little bit better.

On our project, we had a three person audio team. I was audio director and we had a dedicated sound designer and composer. We all worked together in developing the system. I sort of led the charge, but we all pitched in as far as recording voices and things like that.

Recording Process

Let's talk a little bit about the recording process and our approach for how we structured the voice over. We designed about 20 unique voices for the game and the voice actors were ourselves and some of our friends. It was fun and really inspiring how quickly the voices would come together. We settled on a novel approach where we would start with the seed phrase, something like this one: "money banana stand". And we'd ask the voice actor to riff on this phrase and to embody a persona of one of the characters in the game. In this case, we linked this phrase with this character, "The Jester". We'd ask the actor to riff in a stream of consciousness way, keeping in mind the sounds associated with this phrase. It was really fun to do this. Having a seed phrase like this, I found that it made the performances more focused, more intentional. The gibberish felt like a unit, almost sounding like a language of sorts.

Here's another example: "quantization prerogative". Don't ask me what that means, I have no idea. That's just a seed phrase we came up with for this character who's a nefarious magician in the dark arts. That one's actually me, and the other one was was the composer. And so we would take these performances, the ones that we liked, and we'd chop them up into sets of anywhere from 30 to 90 assets in some cases. We would we would delete the the fragments that stood out and in a bad way or were redundant - sometimes, you'd have repeats of certain things. The assets themselves were usually one or two syllables, sometimes three or four. We found that having a blend of those really helped the system to work as well as it does. When it's all single syllable sounds, connecting them together becomes a lot more difficult in the sense that you lose some of the human element. It starts to sound a little more robotic. It starts to sound a little bit more like "Animal Crossing", which is not bad. It's actually really cool. It's just a different style. And when you have multisyllabic fragments, and with the way that the human voice connects sounds together, you hear a bit more of an emotional element in the speech.

These were things we kept in mind. Also cutting out hard syllables seems seemed to work really well. So something like "Brero" or "Di", in having those hard transients, it really helped with connecting fragments together that may have been pulled from different sections of the recording. Even vowels could be used as hard syllables in some cases, depending on the performance.

Implementation Process

Ok, so let's talk about the implementation process now. We have these recordings of different voices, and we've chopped them up into these little fragments. Now we have to figure out how to actually put them together in a way that sounds reasonable.

Let's take this character as an example, "Lady Gray". Here's a card with some text from the game. Thinking about how to implement this, we started thinking about how we can differentiate the characters. What can we do and what will some of the parameters be that will help us to do that? And so this is what we came up with, it's a short list of parameters, but they all give us a certain amount of variability and control. So it's simple, but enough levers to try to separate the characters from each other. So we have voice type, for which set of recorded fragments we're going to use for this character. Pitch, for whether we want to make this character's voice a little slower and deeper, or maybe a little higher and thinner. This was just a nice element to have after recording the voices. In some cases we wanted to tweak them a bit. We had about 20 voice sets that we created, but there were more characters than that. So we ended up using a single voice set for multiple characters in some cases. And so pitch was a nice way to differentiate those usages.

Resonant frequency was a parametric EQ band, essentially picking a frequency to boost or drop, in order to adjust the timbre of a voice. And then fragment overlap size, which is basically about figuring out how much distance to put between our voice fragments. Sometimes we'd use a negative distance to get the fragments closer together. As far as a global parameter goes, we had to figure out, "OK, how long do we want these performances to be?" So we had to come up with a text to speech ratio for duration based on how much text there is. How much speech should there be and how long should it be? It was important to maintain a certain flow for the gameplay. People could be swiping through cards relatively quickly. And so we didn't want the audio experience to be getting cut off constantly by the users' natural way of playing the game. The speech should never get too long. It should never be more than a couple of seconds. But if this text is short, then the speech should be shorter to reflect that. So the question was, how long should this performance be?

What's going to feel natural? And the answer was pretty obvious. It's n/55. OK, maybe, this isn't that obvious, but this is what worked. And what we're talking about here is length in time of the speech. So length = n/55. So what's n? N is the character count. So it's the number of characters in this card. And 55, what is 55? It's nothing. It's an arbitrary number ... trial and error is what yielded this formula, where length equals character count divided by fifty five. And so for this example, the character count is eighty seven. There's eighty seven characters in the text of this card and this formula works out to about 1.58 seconds. Given the card text limit, the duration will never get too long. So this, this works out well.

Now that we have our formula, the way the system works is we take the card text and use it as a seed. We use the seed to deterministically generate random values. And then we use those random values to select which voice fragments to play. And we'll do this as many times as it takes to reach the desired speech length, which in this case was 1.58 seconds.

In this example, we have three fragments, "anats", "bnanda" and "UsTAH!". This is just a made up example. But between those and the fragment overlap amount for this character, which is a negative overlap that brings them closer together, in total that puts us past the desired length. This boundary that we're using is actually a soft boundary.

Going over a bit on time is OK because we're in the ballpark of what we want. And we'll always play at least one fragment, which is important because if it's a really short piece of text, you don't want it to play nothing at all. Because it's seeded by the text, the cards always trigger the same speech fragments every time, which is neat. It makes playback reproducible and it makes it easier to test. One of the hypotheses was "maybe this feels a little less random?". It's hard to say for sure, but that was the intent. As far as making the performances feel more natural, having the overlapping fragments really allowed us to dial in speaking styles. Maybe for certain sets of voices, the speech is slower and we want to adjust the overlap accordingly to match the personality of the character.

The last thing that we did, is to always put the longest fragment that we chose at the end. We found that this just sounded better. It sounded a little bit more natural. And I think a lot of that has to do with the way that we chopped up these performances. A single syllable that gets chopped up tends to be at the beginning or in the middle of a phrase. If it's a multisyllabic fragment and therefore a longer length, it tends to have been taken from the end of a spoken phrase. And so those just naturally sound seem to sound better at the end of these stitched together performances. That's basically how the system works.

Postmortem

I want to talk a little bit about what went right and what went wrong with this system and some future ideas about how to improve on it. I think overall, we all felt pretty good about it. Stitching variable size fragments together works well.

As far as some categories that we can talk about, we can start with language. I think we could have done a bit more to tie the gibberish together. What if everyone spoke the same strain of gibberish? That might have felt more intelligent as opposed to having all these different strains. Although, to be fair that does differentiate the characters a bit. I think that might have been sort of a trade off there, but definitely something worth thinking about.

And then there's performance. I think we could have hired more actors, we could have recorded more voices, of course, and maybe took it a bit more seriously. We were having a lot of fun with it, but I think if we at honed in a little bit on the direction we might have been able to improve the results somewhat. We just went about it in a loose way. I would record a couple of voices,the sound designer would record a couple of voices on his end ... there wasn't a super unified type of approach.

The thing that really drove the system, the deterministic method that we came up with, I think we felt was interesting. It's good for testing because it's reproducible, but honestly, it's not that noticeable for people. And I think because of that, it kind of undercuts the design, because the intention was for it to really lend this sense of embodiment and intelligence to the speech. It doesn't quite hit that mark, I don't think.

Overall, I would say it's a successful as a proof of concept. But between all three categories, I think we could have done a bit more to create a sense of immersion and intelligence with the system. I think if I could do it again, I would try to find a more effective deterministic method, something that would add another layer as far as differentiating the characters, making it feel more like a shared culture between everyone since they're all living in and around your kingdom.

Syllable Based Seeding

Perhaps a system that seeds based on syllables instead of paragraphs length would have given us a better way to do these sorts of things. I actually built such a system that went unused for another project, but it worked really well. The idea was that you would map syllables of text directly to the syllable recordings, using a similar deterministic method. What that would do is give the invented language a more coherent sound, and your experience of it would feel more intelligent. If you had a repeated segment of speech like, "ho ho" or "no, no, no!" the system would reflect that with repetition. That made it feel even more like a real language. Or you might pick up on themes of conversation if multiple characters are talking about some particular theme and they keep using certain words. You might actually be able to pick up on those things. In a subtle way over time, you might start to internalize a sense of this language, even though it's completely fabricated and doesn't really have any function.

Machine Learning

There's been amazing advances in speech synthesis, just recently there was a paper where now your voice can be almost completely replicated with only five seconds of your recorded speech, which is scary and amazing. I think something along these lines could lend itself to a really interesting game implementation for procedurally generating speech. And I'm really excited to see something like that happen in the future. I think we've barely touched the surface of what's possible for inventive language and games. It's a really interesting space.

Most of the gibberish voiceover that I've seen to this point in games has been largely based on aesthetic. And I think it would be interesting to see more investigation into ways of making these inventive languages have a sense of function, even if they don't really have any function. I'm excited to see what we come up with.

Q: Did chasing the fun help ?

I think so. I think the fun is what really drove the process, it was the idea of doing something new personally for myself, something I'd never done before, but also seeing how we can take some of the touchstones for this kind of thing, like "The Sims" and "Animal Crossing" and do something different. Certainly, it can be a technical rabbit hole. You can get really deep with it. In some of my other experimentations I've found it can be really challenging to get good results. For instance, systems that are based on single syllables. That is a really hard thing to do, to combine single syllable sounds into speech that sounds pleasing and doesn't just sound like a robot. It's really hard without advanced technology. I think ultimately it was more of an artful process for me than an intellectual one. And in that manner, at times it was about picking the path of least resistance, picking the thing that was going to give us the most bang for our buck and not get us stuck in a technological trap.

Q: Did you consider applying these concepts to Hyper Light Drifter ?

No, I don't think we thought about that. I'm sure it could have worked, but there was something nice about just having those glyphs and also is really challenging, I think, from a design perspective. Occasionally you'd also have these storyboards that were meant to convey information too ... that to me felt like a different thing. It's possible that we could have done something with with sound. But I think it accomplished what it was trying to do without it.

Q: Can you talk more about your role as Audio Director on Reigns ?

I the first audio person brought onto the project, and so in that way it kind of fell upon me to make some suggestions about about how to move forward. I think originally the expectation and the intention was that I would do the music. But as I got into it, I realized that I was a little bit more interested in the systems as opposed to creating content, and that I would open it up to some other people to get involved. I would just focus on supporting them and focusing on the systems. And so that's really where I started. I was interested in this voiceover system, and I was also interested in a music system that we built for this game, which is also a type of phrase based system. It used four part voice leading ... soprano, alto, tenor, bass. And those four parts were mapped to the four categories that you're trying to manage in the game, religion, the military, the people and the treasury. So each one of those voices is mapped to those in some fairly straightforward senses like driving part volume. But the system also had other challenges, such as figuring out how to transition through the many different phases of the game.

Q : Does localized text generate different voiceover ?

I'm actually not sure ! That's something we should look into. It'd be interesting to see if it's different. That would be appropriate, I'd say.

Q : What did you learn as audio director ?

It was a really good experience for me to to be in that role, because it helped me to learn that I really don't like managing people at all. But I also learned that I really like getting involved with building systems. I find it really interesting and I've already done a lot of systems work since then. I've been working with Dino Polo Club out of New Zealand for the last six odd years, working on Mini Metro and Mini Motorways, and have been done lots of interesting sound work for those games. On Solar Ash, I've been doing a lot of systems work as well. I've always had a bit of an itch for the technical problem solving and coming up with novel approaches to things. Between that and learning I dislike working in a managerial role, it was a great learning experience. Actually at Heart Machine, that was a role that was mine in theory, if I wanted it. But we ended up bringing in somebody else to take on that role, because I didn't want to do it.

Q : How would you recommend applying this technique to an in-game radio channel ?

It's hard to say. A radio program is a totally different form factor. When you don't have the context that you get from visuals, then you're really asking a lot from gibberish. I think as a result of that, the gibberish would be more effective if the context was really dialed in on the audio side. And for that, I think about "The Sims", because of the way that they use their language "Simlish". They tend to really dress it up in the context of what it's trying to do. For a radio show, you have all the bells and whistles that go along with that - the tone of the voices, the rhythm of the performances ... maybe there's music and little sound effects and things that all kind of contribute to that. It also depends on what your goals are with the gibberish. Are you trying to get people to actually understand the content, or is it more just about an impression and creating a feeling? I think all of those things really matter as far as how you would go about something. So I would keep those in mind. It's hard to give specific suggestions without knowing what the ultimate goals were, but that's what I would say.

Foreword: Fruits of the Desert

My colleague Noah Kellman asked me to write the foreword for his recently published book, ‘The Game Music Handbook’.

In visual media, there is a tendency to treat music like polish. Hired late in the process, composers typically begin working on a project that's otherwise nearly complete. Our task is to put the contributions of the rest of the team in a more flattering light.

In video games, we have a chance to cultivate a different way. With the prevalence of small, flexible, independent teams, and a growing stable of smaller, more creatively lenient publishers, we as composers can take on a more involved role. We can exist more like artists and collaborators, and less like the last-minute afterthought, taken for granted. We can stop being outsiders and start to embed more deeply with our colleagues. We can create more opportunities to earn their trust and respect. We can play a critical role in shaping the trajectory of the games we make.

There are a million and one ways to structure music in games. Take 'The Secret of Monkey Island', where the score modulates effortlessly from song to song as you travel the Caribbean. Or perhaps 'Uurnog Uurnlimited', where music algorithms evolve alongside player choices that other games rarely acknowledge. The scale of what's possible continues to grow year over year. And within this space are ways you and I can contribute that sidestep typical expectations.

Doing your best work generally requires time and freedom beyond the norm. You need more of both — so that you have the space to maneuver out of false starts and dead ends. The more you intend to challenge yourself and dance along the edge of what is possible, the more flexibility you need. You can't go out to the fringes without the trust of your coworkers, and you can't make your way back in a beneficial manner without sufficient time. So, if you can, get involved early, and don't forget to take breaks. Deep work requires deep rest.

If you do find yourself on board at the outset: Congratulations! Welcome to the desert. At this stage, everything is uncertain, and anything is possible. In this unfamiliar territory, you may not always have a 'feel' for the trajectory of a project, and your colleagues may not either. There can be anguish in trying to write music for a game that has yet to figure itself out. You may find it fruitless developing a musical style in such foggy surroundings. It can, therefore, be beneficial to put off the content aspect of production until you have a better sense of the scope, scale, and essence of the project. For the moment, research & development may be the best use of your time.

I found myself in this position during the development of #Solar Ash. It took years to solidify the identity and mechanics at the core of the game. If I had spent much of that time making music demos, trying to hit a moving target, it would have been rather inefficient. It proved more fruitful to focus on weird, novel, and spontaneous ideas, workflow tools, and audio systems.

#Solar Ash was the first fully 3D game I ever worked on, and I treated the experience like being in an educational sandbox. Learning about tangential disciplines has a way of stoking the mind, and I learned a whole lot of math. I challenged myself to reinvent wheels, building things I'd seen other people do. I developed a tool for doing doppler effects (how the pitch of a sound changes as you move quickly). I also built a dynamic reflection system, something I'd seen in AAA games, like Overwatch. But I tried to find new ways to use them. You could use dynamic reflections as a stylistic effect instead of a realistic one. What if doppler pitch effects were used not just to alter the pitch of moving objects, but music? I became singularly focused on building out systems that could react in dynamic ways as you move through three dimensions. Being so engrossed in a new context lead me towards ideas for systems I'd never seen before as well. One example is a forest of trees that sing to you as you move through them. Your angle and velocity relative to each tree yields a different musical note. By diving so deep into sound propagation and vector math, I better understood the challenges and possibilities of this new paradigm, the 3D environment. And I came out the other side with ideas I would never have come up with otherwise, and a newfound ability to implement them. When you work in a more generalized way, you're more likely to create something with multiple use cases down the road. I've already reapplied much of the new math I learned while working on #Solar Ash. When I need a way to modulate a sound based on its positional relationship to other objects, for instance, I can take the dot product of two unit vectors. Whether you're personally familiar with these examples is unimportant. With time and space to play, you can collect an assortment of tricks, forever available to you and your peers (if you're kind), for the benefit of the game, and beyond.

There are ways to contribute besides music, though it can be challenging at times. You may be working with subject matter, people, or genres unfamiliar to you. I wouldn't have a great idea about how to contribute to the design of a fighting game, for instance, beyond giving feedback about whether it feels good or not to play. There are times when it's not a natural fit to chime in beyond musical boundaries, or perhaps you don't feel like it, and that's okay. Still, there may come a time when your unique background, skill set, and perspective afford you insights no-one else has. This pre-production period is a great time to interface with new ideas outside of your comfort zone. When the pressure to deliver content isn't there yet, you have the opportunity to pick up new skills and learn from your colleagues.

We all have capacities that extend beyond our outward specialty. I used to build websites and I love making logos. The sound designer I currently work with has also directed game projects. You might have an affinity for literature. You have more to offer than what it says on your business card or website. Getting involved early allows others to know you better, and gives you a chance to put more of your personality into the game. You get to be a part of the prototyping phase, whether you're exploring ideas in a musical silo or iterating on cross-disciplinary concepts with other members of your team. You, as much as anyone else on your team, can theoretically steer the direction of the project. Sometimes even the smallest contributions, ideas, and suggestions can have an outsized impact.

Over time, there is a cumulative effect to all of this novelty exploration. You start to fill up your bag with hard-earned tricks, lessons from success and failure, and curiosities worth a look down the line. Some of these will be unique to games, while others may overlap with other musical forms, like film or theatre. It could even go beyond into other mediums, like poetry or painting, should you ever find yourself there. For instance, in my bag, I know I can effectively create procedural sequences and iterations on many things (music, writing, etc.) using Markov chains. Or I can shift any note in a diminished chord down a half step to get to a dominant chord (and eventually to other keys). I know that a mono reverb in the middle of a signal chain can help to fill out a sound, which for me cross-pollinates with the way I use layers and effects in Photoshop. And by giving musical elements different bar lengths, one can create musical variations that go on without repeating for years.

The resourcefulness you accumulate in your creative travels may allow you to sidestep a mental block. I bring back old ideas from the dead all the time. The scores I've written for successful games like #FEZ and #Hyper Light Drifter contain tons of old, repurposed ideas. But I think we benefit from first stretching ourselves in those uncomfortable moments, continuing our search for new territory. There are always new hurdles to cross, and if you want to do your best work, prior experience alone won't leap those bounds for you.

Early on in my career, I worried that sharing ideas and discoveries freely with others would dilute my uniqueness, jeopardizing my chance at success. But we all benefit tremendously from the collective knowledge accumulated by those that came before us. And even with an open spirit, sharing what you know with others, you will still have a bag of tricks wholly unique to you. There are some things we can't effectively externalize, and we all see things a different shade. There's a little bit of the specific, a little bit of the broad, and a whole lot of you in every creative experience. Take some time to look through your bag, reflecting on what you've learned. Give your brain the time and space to make new cross-connections. It can help you make better sense of your work, your colleagues, and the broader creative world around you.

The groundwork for the 'tried and true' approaches to video game music has been laid down and built upon for decades at this point. And you will find this book to be an excellent primer and guide to applying those techniques and concepts. But many exciting possibilities lie beyond the well-worn path, in the endless desert of creative hypotheses. If you can, spend a few weeks there, trying out novel ideas: Explore an implementation approach that you've never heard before. Follow a silly thought experiment down a rabbit hole. Flip your usual strategies upside down. If you can stomach the struggle, the benefits you return with can be extraordinary.

Sure, you will sometimes find you've unintentionally reinvented the wheel. Or learn you've failed at something someone else already discovered didn't work. But none of this exploration truly goes to waste. If you do uncover a gem, it may not suit the game. And trust me, when you come back from the desert, you may return with some inappropriate ideas! But you can always store them in your back pocket, for use at a later date. If your time was not overtly fruitful (to be fair, the desert doesn't have a lot of fruit trees!), you've at least learned something. And the more you wander, the more you'll learn the cost of your creative choices. You'll be humbled by what you could never accomplish without help. And you'll wise up to hairy problems that are hard to pull off even once.

Working on video games is not always fun and games (go figure). But creatively speaking, it has put me in some of the most compelling and confounding circumstances I've ever come across. So I guess what I'm saying is: don't be afraid to step into an oddly original landscape. You'll inevitably find yourself in a predicament anyway. So you might as well have a say about your surroundings.

Take this invaluable book, or a book like it, with you on your travels. You may enjoy breaking the rules more if you know what they are.