Overview: Video Motion Capture Solutions

Although Motion Capture is now the international standard for animations of humanoid characters, it is often too expensive for smaller companies. We regularly hear smaller companies in the game development industry ask the question: couldn’t we just do MOCAP with a camera or webcam, using AI? The images we have already seen of AI that recognizes people on video images look promising. On top of that, a skeleton is recognized that matches the location and pose of the people in the image. Within this overview we look at the state of the art of MOCAP systems that only use a webcam or video camera. We discuss how much potential for improvement there is and whether it is worth it for you.

Existing Software/Services

We first look at some ready-made solutions that you may have already seen on social media. These services promise a lot and their showcase videos often look impressive. In this blog post we test the packages and see what they are capable of.

Deepmotion

A first solution is DeepMotion. They advertise fervently on social media, and their promotional videos look quite promising: Deepmotion costs +-147€ per month, for which you can convert 2 hours of footage into animations. If you want to convert less than 2 hours of footage, they also have cheaper subscriptions. Deepmotion identifies some requirements for best results:

• The camera should remain stationary and parallel to the actors

• The entire body must be visible

• As great a contrast as possible between people and background

• No loose clothing over knees / elbows

• No occlusions

These requirements already somewhat limit the capabilities of their service, we already tested the service for you, but deliberately used clips that violate some of the above requirements. The aim is to push the limits of technology. For a more standard example that meets their requirements, watch the clip above. The results of Deepmotion in ideal conditions are generally not very stable and quite “jittery”. What’s particularly noticeable is that the service has problems with the 3D aspect of a scene, which makes sense since the input videos are a 2D representation of the animations. This ensures that the resulting animation is correct from the perspective of the video, but not necessarily 100% accurate in the third dimension. One of the consequences of this is that the feet do not remain consistently on the ground, but rather float around. In the dance video, Deepmotion is confused by the dancer’s plaid shirt, and in the parkour video, Deepmotion has problems with the dancer being temporarily partially covered by the swing. Ultimately we can say that the results are not bad, but the usability of the service is quite limited due to the many requirements for good results. With some post-processing, the results can be used for specific use cases.

Radical

Another service you may have heard of is Radical. Below you can see the video they use to promote their service. Radical is slightly cheaper than Deepmotion, €64 per month, and for an unlimited number of minutes. Radical also identifies some “best practices”, which are virtually the same as with Deepmotion:

• The camera should remain stationary and parallel to the actors

• The entire body must be visible

• As great a contrast as possible between people and background

• No loose clothing over knees / elbows

Below you can see the results from one of their preset videos. Here too we already tested the service for you, and we used the same clips that we also used with Deepmotion. The results can be found below: The difference between these results and the results from Deepmotion is striking. The jitter that was continuously present with Deepmotion has almost completely disappeared. The results are very stable and consistent. We do note that there is a trade-off for this: the animations lack a lot of details that were present in the original videos. Radical applies a kind of “smoothing filter” to the results, which makes the results rather wooden and inaccurate. Radical also struggles with the 3D component of the scene and the characters’ feet float back and forth on the ground. The parkour video worked reasonably well for Deepmotion and could at least serve as a starting point. Radical’s results, on the other hand, are absolutely not useful. This is obviously an extreme case, but it gives an idea of how robust their system is.

Other Implementations

We now briefly look at some other existing video MOCAP solutions, both standalone papers and solutions that are integrated into other applications. The aim is to form a more complete picture of the State of the Art (SOTA) of this technology.

Wrnch

NVIDIA recently released Omniverse Machinima in Open Beta. Interesting for us is the MOCAP extension used in it; wrnch AI Pose Estimator. You can see the results of wrnch below, we see the same problems here too; unstable feet and serious problems with the 3D aspect of the scene.

XNext

We always see the same limitations in papers from recent years. An example of this can be found below.

Openpose

Openpose is an open source library for real-time body, foot, hand and face keypoint detection. It is open source and therefore free. It is used for all kinds of applications. We briefly review the accuracy of this system, specifically for our Motion Capture use case, which requires much higher accuracy than other applications. This interesting article did some tests with OpenPose and with some different cameras. They used two different camera resolutions and sample rates: 1920x1080pixels at 120Hz (equivalent to 1K), and 3840x2160pixels at 30Hz (equivalent to 4K).

The OpenPose algorithm is applied to each individual frame of the video, i.e. there is no temporal continuity. The average absolute deviation of the corresponding joint positions, calculated from the two different motion captures for both trials, is as follows: approximately 47% <20 mm, 80% is <30 mm, and 10% is >40 mm. The paper concludes that it is necessary to use a system that fixes errors caused by incorrectly detecting limbs from time to time. With this system (automatic correction of gross errors) in operation, they generally achieve an accuracy of 3cm. Not that impressive, because if every joint is off by more than 2, almost 3cm half the time, you have, in my opinion, a Motion Capture system that is actually not accurate at all. Good enough for other use cases? Sure. But with the aim of generating animations that can be used in games? This library is simply not accurate enough for that.

Summary of existing systems

We see that the existing services, libraries and techniques from papers can be useful in specific cases where the actor doesn’t need to move around a room much, so things like upper-body animations will be able to get reasonable results here. More complex matters, such as locomotion, parkour or something where the context of the room is important, are still relatively out of the question at the moment. All in all, we can conclude that there is still a long way to go for these types of services before they can serve as a full-fledged alternative to mocap suits.

Conclusion

From the above examples (and many others) we can draw a simple conclusion: “Garbage in, garbage out”. If the data we provide to our MOCAP system, in this case a video camera recording, does not contain enough information to generate full-fledged 3D Mocap, the result will never be what we expect. We can conclude from these examples that 1 camera simply does not provide enough data to do full-fledged 3D Motion Capture. That opens the question: so what?

We could augment our data by also including the Depth. This may help with unstable feet, for example, but will still have problems with obstructions and more complex movements.

The ideal solution is to simply add more cameras. The more cameras, the more information our system can use and the better the results will be. As an example of what such a multi-camera system can achieve, below is a video of an existing (and expensive) solution that allows the user to take multiple camera images as input.

Unfortunately, there are not yet many solutions that use different camera images as input. Academic research also still focuses too much on a single camera recording, which is less useful to us in practice. We came across Captury Live during the short research for this blog post, but this software costs an incredible 25,000 Euro.

If you are still looking for video MOCAP solutions, keep in mind that multiple cameras will achieve better results anyway.

If you come across any promising papers or software, please send them our way at [email protected] and we will possibly look at them later in the year and share our findings with the group in the form of a follow-up blog post!