Creating A Dancing Character With Machine Learning
In the last several years we’ve seen many incredible ways that machine learning (ML) has been applied to solving fascinating, hard problems out in the world. From predicting protein folding structures and novel drug molecule candidates to creating generally capable chatbots for work or play, ML has demonstrated its utility in many aspects of modern life. ML is also showing awesome potential to revolutionize the games industry, both in the features we can now add to our games as well as in the game production process itself. Resolution Games started exploring how ML could impact their games late in 2021 by hiring me to start their ML team.
I came to Resolution from academia having done a PhD at GeorgiaTech studying computational creativity, ML, and improvised movement with virtual characters in XR. I then did a post-doctoral research scholar position with Microsoft Research looking into some of the problems that game developers might face when trying to use ML techniques like reinforcement learning in their industrial pipelines. Working at Resolution gave me the opportunity to learn and apply cutting edge ML techniques to challenging problems in the game industry with real-world impact on a global audience.
ML was brand new to the company and we needed the time and space to build up our ML development pipelines and processes, while still adding value to a concrete project. We were fortunate enough to join a smaller, more experimental first project - Fish Under Our Feet (Fish) - that helped us with exactly these requirements. Fish was planned as a pathbreaking showcase for Meta’s new (at the time) passthrough API for developers to use with the Quest 2 headset. Through the bite-size experiences in the demo the Fish team demonstrated clever ways to use hand interactions, voice interactions, spatial anchors for object persistence across sessions, room-mapping for aligning the game with a player’s room/furniture, and much more to build an exciting vision of how passthrough augmented reality (AR) could inspire future players within the constraints of the hardware platform.
Fish was looking for a psychedelic, sci-fi AR dance party where you could dance with an extra-dimensional robot character for one of the experiences in the game. With the potential for ML to add an improvisational flavor to the dancing character and some previous work I had done with improvisational dance and AI systems, this seemed like a great opportunity to contribute. We decided to focus on real-time motion synthesis for the character’s dance movements in response to player dancing for the ML part of the project. We aimed to create a responsive character that would be able to both lead and follow the player’s dancing by establishing a relationship with the player by imitating their movements, playing with that relationship by generating movements connected to the player’s movements, and breaking that relationship at times by improvising completely novel dance moves to inspire the player. Read on to find out about our experiments creating the dancer for Fish Under Our Feet and the many lessons we learned along the way.
What is Machine Learning?
In order to understand how we built the real-time motion synthesis powering the dancer character in Fish Under Our Feet, we need to understand a bit about machine learning. Machine learning is a set of techniques for using data to create a computational model to solve some task or improve its performance on that task. For example, we might use ML to learn a model to recognise cats and dogs in photos that people upload to a photo sharing service, so that pet lovers can find, view, and comment on adorable pets. The data used depends on what problem the model is trying to solve. In this example, it could be collections of images of dogs and cats with labels provided with the images to say whether it features a cat, a dog, both, or neither. The model is then trained on this dataset of images and labels so that it can predict whether other images it sees in the future contain cats, dogs, neither, or both. This is incidentally called supervised learning since the model’s (potentially incorrect) predictions while it trains are supervised by the correct answers it should have provided, like the cat/dog/neither/both labels in the previous example.
Machine learning is a subset of the broader class of AI techniques in the field. Within ML itself, many of the most recent exciting results that have driven ML hype and excitement have been accomplished using deep learning or machine learning with deep neural networks. However, to understand them, we must first talk about artificial neural networks in general. Neural networks are very coarse approximations or high-level analogues of biological neural networks. In a simplified sense, they consist of numerical matrices, called weights, that are multiplied together in sequences and transformed through non-linear mathematical functions to generate some numerical outputs. By adjusting the weights systematically using training methods like supervised learning, neural networks can be made to generate useful outputs given some relevant input. For example, a neural network can be trained using supervised learning to recognise whether an image contains dogs or cats by inputting the image to the network as a set of pixel values representing the image and getting a numerical output from the network corresponding to the prediction cat, dog, neither, or both.
Deep neural networks are basically longer (or deeper) sequences of neural network units or blocks that have been designed to take advantage of the vastly increased amounts of data that have been collected about interesting problems and the vastly increased parallel computing power that has become available to crunch all that data. New ML architectures designed for deeper networks have made it possible to better utilize the advances in data and compute. These include Convolutional Neural Networks (CNNs) that work well on visual data as well as Recurrent Neural Networks (RNNs) and Transformers that work well on sequential data.
The deep learning revolution has also exploded applications in generative models, where the ML model tries to learn the probability distribution from which input samples (like images of cats in a dataset) are drawn so that it can generate new samples from that distribution and thus create new images of cats that don’t exist. These generative models use techniques like Variational AutoEncoders (VAEs), Generative Adversarial Networks (GANs), or Diffusion models among others. We used many of these advances to create our dancing character.
Building An Improvisational Dancer With Machine Learning
We built the AI dancer character in Fish Under our Feet using deep learning and generative models. The design and implementation of the character presented us with many challenging problems that we tried to solve in interesting ways. Ultimately, we learned a lot throughout the process and we’re excited to share what we’ve learned.
A Unified Model For Imitation, Transformation, and Novel Generation Of Dance Movement
The AI character’s desired capabilities on the dance floor posed an interesting set of challenges. The character had to be able to imitate the player’s dancing closely, generate variations of their dance moves, and generate entirely new dance moves when required. All improvised in real-time, to respond appropriately to absolutely any kind of dancing a player could possibly do.
We created a unified model architecture for all of these character capabilities based on Convolutional Neural Networks (CNNs) organized into a Variational AutoEncoder (VAE). A VAE consists of two parts — an encoder and a decoder — that during training, try to first compress an input to a lower dimensional representation using the encoder, and then decompress it from this lossy compressed form back into the original input using the decoder. The model is provided with a training objective that both penalizes it for deviating from the original input in the decoder’s recreation of it, while also helping it learn to represent or encode similar high-dimensional inputs close together in the model’s compressed low-dimensional space. So, for example, similar images of cats in the original, high-dimensional pixel space of cat images would be learned by the model to lie close together in the compressed low-dimensional latent space of the model. After training the encoder and decoder that make up the VAE, when deployed in a final application, the encoder is usually ignored and random points from the model’s low-dimensional latent space are sampled to generate new high-dimensional input space creations from the model’s decoder.
We used a VAE, as explained above, to learn a representation of dance-related character poses from dance mocap data sets. Player dance moves could be encoded into trajectories in the VAE’s latent space of poses and these trajectories in latent space could then be decoded into a coherent imitation of the player’s movement, albeit interpreted through the lens of the ML model’s weights. Simultaneously, generating arbitrary, continuous trajectories in the agent’s latent space somehow would create smooth, completely novel dance movements, once decoded. In addition, the player’s dance moves encoded as trajectories of poses in the model’s latent space could be manipulated to create interesting variations of the player’s dance movements in real-time.
The resulting pose-generation model formed the basis for the AI character to be able to dance with people. The character could dance and respond to the player’s dancing in real-time with imitation or novel movements. However, it immediately brought up some additional challenges that needed to be solved to make this approach viable.
Acquiring Enough Data For Deep Learning
The first, and most fundamental, problem we came across was how to acquire enough data to successfully train data-hungry deep learning models for the work. Through careful research we found that no datasets or data sources existed for motion captured human data in the form of three-dimensional skeletal joint data with broad enough and deep enough coverage of dance movements. Plenty of motion captured datasets existed for limited activities like walking, running, jumping, etc. However, these did not include dance movement in any depth or range.
Luckily, we also found that several permissively licensed video datasets existed with a reasonably large number of video clips showing dancers and dancing. We then investigated the capabilities of computer vision models from several state of the art research papers and research codebases that could potentially convert the video data of dancers into 3D skeletal joint data, i.e. usable mocap data. We tried several on a subset of the full dataset we had curated and based on parameters like speed of extraction, accuracy, platform support, hardware availability, and alignment with our use case, we settled on the MediaPipe Pose model based on BlazePose.
We created a manual data processing pipeline in Python to process the video files through this model to generate clips of skeletal mocap data corresponding to each clip of video data. We then converted this skeletal data into NumPy array datasets for ML training stored in HDF5. We experimented with several strategies for creating these datasets, querying them at training time, and structuring them for more efficient learning. Ultimately, we settled on a format where a single HDF5 dataset stored all the clips of skeletal joint data, where each clip consisted of an array of poses for a single dancer. During training, these clips of skeletal poses over time (signifying motion) were used as individual poses.
Generating Character Root Motion
We wanted the learned pose representation to be useful for generating character poses in any root position, which would mean that the model was more general and reusable at different stages of a dance move or with different directions of movements in space. So the model we created, generated poses with the character’s root at the origin and all the character’s joints relative to that root origin. This decision, however, had the unfortunate side effect of removing any root motion for the character. The model was able to generate smoothly varying poses from its latent space and showcase the resulting dance moves. However, because the character did not move its root, it looked like a marionette, floating in place.
Since we wanted the model to generate novel motion that had never been seen before, it made sense to use an ML solution with potentially better generalization capabilities than a hand-crafted solution for the character’s lack of root motion. We created an ML model to predict the best root positions for the generated poses given the recent motion history of the character’s joints. Essentially, the model would guess the best position for the root joint of the character, given the previous positions of its limbs and the root.
This worked out relatively well using a cheap feedforward network that ran on CPU every game tick. You can see the results in video 4. We tried many different model architectures, including those with RNN memory, but the simpler eventual model we used had the best combination of resource-efficiency, speed, and accuracy. With more and better training data, the prediction performance could be improved in the future. We also tried predicting the relative change in root position (the root motion) instead of the absolute root position to make the predictions more generalizable. However, the relative root motion prediction suffered from compounding drift over time and the absolute root position prediction proved more accurate overall.
Imitating Full-Body Player Motion With Only Head And Hands
Now that we had real-time pose generation with reasonable root motion prediction, it was time to use the model to power the character to do everything from imitation to novel dance move generation. It quickly became apparent that imitating a player in VR would be difficult because the VAE encoder needed a full set of 22 3D skeletal joint position inputs per pose in order to imitate input dance moves but the VR hardware only provided three joints as input from the player — i.e. the player’s head and two hands. Passing in the full-body frame with only 3 out of 22 joints populated with data (and the rest zeroed out) resulted in poses that were completely unrealistic, non-human like, and crumpled up like a ball.
Since we had a good encoder for a 22 joint input, we figured we would try to reuse it as much as possible and distill its knowledge into a new encoder. So we trained a new encoder using supervised learning on a modified version of the original pose-generation training data set that had for each body pose, all joints except the head and two hand joints removed, but with the labeled or ground truth output being the latent space point from the original encoder asj if it had received all 22 joints for that body pose. Thus the new encoder was trained to imitate the full-body encoder output and encode similar points in its latent space but with only the three input joints that the VR headset could provide us.
This new encoder was swapped into the pose-generation architecture and saw decent results at predicting the full-body latent space coordinate for a given pose given only the head and hand positions. Predictably, the model did well on the poses where there was less ambiguity with what the feet were doing given the hand and head positions and less well on other poses where the feet could be in any number of possible positions. The model generally erred on the side of predicting lower body poses that were more common in the data set, like standing, rather than, for example, kicking. You can see the performance of this machine learning predictive IK system for the lower body in video 5, where we compare the full body recording played from file, our imitation of that recording given all 22 joints, our imitation of that recording with only 3 joints, and the original encoder given only 3 joints with non-zero data.
An alternative approach would have been to use an off-the-shelf inverse kinematics (IK) rig for VR systems like FinalIK and use the predicted positions of the players body generated by that IK rig as inputs to the original encoder. However, this would have been an additional burden on our VR system’s resources, especially when tracking 22 points on the player’s body rather than the usual limited number of joints. From the previous figure, you can see that the performance we got from the retrained encoder is about on par with what can be expected from a general purpose IK rig. Additionally, using an encoder trained on dance data allowed us to potentially predict dance-related lower body poses over the general standing-related body poses that an IK rig would have computed, at least some of the time.
Lessons Learned
The biggest lesson we took away from the experience of creating an improvisational dancing character for the passthrough tech demo, Fish Under Our Feet, was to focus more on getting higher quality data in the future. We used an innovative approach to source our data, extracting it automatically from video data sets of people dancing. In the year and a half since this work, this approach now seems to be a commercial service offered by several start-ups. Our approach to data acquisition provided a good compromise between the scale of data and the cost of collecting that data from higher quality sources such as motion capture. However, there were still several problems with the extracted data like foot sliding and root/joint position noise in the Z axis (the depth of joints, which is hard to estimate from monocular video/images). These artifacts were then exaggerated when using the root motion prediction model to move the character root around as it was generating pose sequences. Newer models for extracting joint position and rotation data from monocular video footage have also become available for this task, since the time this work was done, with improved accuracy and resource efficiency. These newer models would also be helpful for improving the system in the future.
Another important lesson that we learned was that joint rotations were as important to have in the training data as joint positions. The lack of joint rotations extracted by the ML models from the video data made it really hard to use a standard rigged mesh for visualizing the dancing character. So our talented 3D and VFX artists had to work hard to balance the visual fidelity and aesthetics of a procedurally generated character driven by particle effects with a translucent mesh that could provide more clarity and definition to the agent’s limbs while dancing. In doing so, their solution also had to work even when the generated character motion had the agent’s limbs move realistically in position but rotate completely unrealistically.
The ML model we created that allowed us to procedurally generate dance moves was always designed to be part of a larger architecture that would control what the model was generating. This was central to our decision to use a VAE for pose generation rather than using a more autonomous solution like an RNN to generate pose sequences directly. The larger architecture had the ability to generate random loops of dance movements from the pose generation model’s latent space using a natural spline-based looping curve generation algorithm. It also had a simple turn-taking and improvisation system that used a measure of player activity and stillness to smoothly alternate between generating novel dance moves when the player was still and producing modified imitation of the player’s movements when the player was also dancing actively. However, this did make it harder to understand how the agent works and really let the player appreciate the agent’s capacity for dance imitation, transformation, and generation. Adding a more explicit turn-taking system would have allowed the player to see their dance moves reflected in character’s actions, or conversely, understand how novel the movements were.
We had also planned for a stretch goal of training a sequence generation model like an RNN to operate in the agent’s latent space having been trained on different dance styles or sequences as encoded by the VAE. This would have been the next step, replacing the random spline-based loop generation in the model’s latent space used in the released model to generate sequences of novel dance poses. It would possibly have restricted the pose generation to dance sequences that made more sense stylistically and physically, for example, preventing double jumping moves in mid air. The slightly alien dance move generation did make sense with the character’s aetherial extra-dimensional background and aesthetic, but it would be great to be able to use this in experiences where the generation is required to be less creative and more realistic.
We could also have used an IK solver in combination with the ML-driven system to reduce foot sliding and other generation artifacts on the lower body and improve the realism of the generated movements. However, that would have required a fine balance to tune between constraining the perceived creativity of the model and the realism of the generations. We could also have used traditional game AI techniques like NavMeshes to have the character navigate around obstacles in the scene with the player and dance on tables or dance at different locations in the room with the player. Hopefully future versions of this work can use these ideas to make the passthrough AR aspects of the dancing character even more immersive and realistic.
Overall, this was a really fun way to introduce ML to the VR and AR games we make at Resolution Games. While it would have been lovely to have a few more months to refine and polish the system’s outputs (isn’t that always the case?), it is personally gratifying that we managed to create the initial framework for a truly improvisational dancer using ML. Our system enabled the character to go beyond pure mimicry or de novo generation to something that could move fluidly between these two extremes, simultaneously incorporating player dancing and originality into the procedurally generated experience. We learned a lot about every step of the process needed to create this dancing character using ML and ship ML technology in an actual game experience. To the best of our knowledge this is one of the first public prototypes that uses deep learning and generative models to synthesize character movements on-device in real-time on a standalone VR/AR device. We’re very excited to see what’s next.
If you would like to see more of what our ML team has been up to since this work, stay tuned for news about our upcoming game Racket Club.