NVIDIA Maxine: How AI Will Revolutionize Video Streaming

Covid-19 drastically changed the work landscape, and in-office meetings are now done nearly exclusively online. But with the transition from the office to home, some major challenges emerged. From screaming kids to family members wondering around in the video background. NVIDIA Maxine aims to solve some of these issues using state of the art artificial intelligence.

AI-Powered Video Conferencing with NVIDIA Maxine
Inventing Virtual Meetings of Tomorrow with NVIDIA AI Research
INTERNET REVOLUTION with NVIDIA MAXINE AI Video Streaming

#soliloquy-container-294{opacity:1}#soliloquy-container-294 li > .soliloquy-caption{display:none}#soliloquy-container-294 li:first-child > .soliloquy-caption{display:block}

Virtual Backgrounds

Like most people, especially in the hi-tech industry, when Covid-19 hit, my office moved full time to my basement at home. As a product manager my day is comprised almost entirely of meetings, which for the most part happen over Zoom, Microsoft Teams or Skype.

Behind me are actually mirrors, and unfortunately when I am presenting colleagues can see my monitor! Luckily for me, Zoom implemented virtual background. Empowered by AI I can now hide my mirrors, and thus my screen content. Plus I look like I am in space 🙂

While virtual background are not new, they have a very long way to go. Without a green screen, AI still get confused what is the person and what is the background. I wear glasses which often show the background in the space between the glass and my head. Also if I am holding something or lifting my hands it often gets thrown off. NVIDIA Maxine claims to be ready to tackle this problem head on. Any improvement to virtual background technology are more than invited.

Video Bandwidth Reduction

Deep fake technology, aka putting one persons facial movements on another face seamlessly, has lately gotten some seriously bad rep. Anyone can take a video recording and replace replace facial features with the US president, or take any persons face and plant it on a pornography video. But, as the NVIDIA Maxine shows, it also has positive and useful applications like taking high quality video bandwidth to nearly zero!

How is that possible? Well the best explanation you can get is by watching the video above, but here is a simple breakdown. The receiver gets a high quality image from the sender. Then the sender translates his video into facial feature locations and sends them (where are the eyebrows, mouth, cheeks etc). The receiver modifies the high res image based on the new facial feature positions, resulting in a realistic animation, without the receiver ever sending the pixel data for the video.

While fortunately for me the internet speed where I live is good, others struggle to get past voice chat. This technology can enable them to move into video chat conferencing at a very low cost. But even I can benefit from such a technology. If I am abroad I generally have only 1 gigabyte of data, which with video would quickly get consumed. But with this tech I could see my girlfriend in video, without killing my data plan!

Source: NVIDIA developer

Face Re-Animation

My home set-up uses a laptop and screen. The screen is directly in front of me, while the laptop is to the side, and looks at me in a 45 degree angle. As a result, when I am in a video conference I am looking strait at the screen, but not the camera.

To me this seemed like an inconvenience, but it never really crossed my mind how solvable it is! Using the same facial key points technology as above, you can use a GAN (generative adversarial network) to re-align the face forward.

In theory, this could be used far beyond just re-aligning a skewed camera. I could be reading off a peace of paper, or talking to someone beside me. But to the people I am video chatting with it would appear I am still looking at directly at the camera. It can also be a support tool for people who have a hard time staring forward during presentations, or people reading off the screen (in which case their eyes will never look at the camera).

Source: NVIDIA developer

Conversational AI

Honestly, its ridiculous to me translation and caption services are not more common. My wife’s grandma speaks Spanish, and is depth. When they video chat, her grandma reads her lips, plus some additional sign language. I don’t speak Spanish, and she does not speak English. I am constantly frustrated that I can not talk in English, and have her see what I say in Spanish. I guess the technology is more complex than I think. I would have even wanted to have her sign language me, and have that be translated to caption on my end.

For work applications this is of course great. Have a real time video conversation with someone who speaks a foreign language. Most of my co-workers speak Hebrew, but sometimes I have people in meetings who speak English. Suddenly, for 1 person of 15, the meeting is done in English, a second language for most.

If that person could turn on caption and simply read what I say, that would be, at least to me, revolutionary. I think that most people can express themselves much better in their native tongue, and this can be crucial when you are trying to explain an idea or problem. I am guessing this feature will show up fairly soon in most platforms. This feature is based on NVIDIA Jarvis, a fully accelerated conversational AI framework.

Source: NVIDIA developer

Super Resolution

Its nothing new that AI can take a low-res image and scale it up without loosing quality. It makes perfect sense that this feature would be integrated to video conferencing tools. Using this kind of technology you can get the same quality, but transfer a lot of the work from your internet bandwidth to your CPU or GPU. This is not unique to just video calls, it can be used on any type of video content where the AI is familiar with the video content and can accurately predict how it should look in high res.

Furthermore, you could do device based scaling. For example the video would always transfer on a certain size. For mobile it would not scale up, as it is a small screen. For tablet it would have a small up-scale, then for a monitor more, and finally for a big TV / Projection the most.

Last Words

My posts are generally very short, I believe a video speaks a thousand words a second. So if you did read this far, thank you! I hope you are as excited as me for all the new features AI will bring to video conferencing.

Future Is Now

Artificial Intelligence and Technology

NVIDIA Maxine: How AI Will Revolutionize Video Streaming

Continue discovering

Be the first to comment on "NVIDIA Maxine: How AI Will Revolutionize Video Streaming"

Leave a Reply Cancel reply