Streaming VR from the Cloud with the Rift

Post by **rfurlan** » Mon Apr 08, 2013 1:29 pm

I am cross-posting my latest blog post here because I would love to hear the community take on this idea

A while ago I was working on a fully immersive stereoscopic “remote head”. On one side the user wears a HMD and on the other side there would be a servo-actuated stereoscopic camera programmed to match the orientation of the user’s head-tracking device. Sadly even if I used very fast servos it wouldn't be possible to move the camera fast enough to accurately match the perspective of the user. The head-tracking lag would be quite disorienting, unacceptably so even before we factor in the network lag.

On the second prototype, I decided to use a monoscopic 360 degree camera instead. The remote head would transmit the whole image-sphere to the user’s machine and I would clip the viewport on the client-side using the tracker information – effectively eliminating head-tracking lag by doing it locally. The overall experience should be great even though the video feed from the remote head could be several hundred milliseconds behind real-time.

And here is how all of this intersects with VR: a cloud-VR server could render a 360 degree image sphere around the player, transmit the whole frame to the client which would then clip the viewport based on the orientation of the user’s head. It could even be done adaptively to save bandwidth – instead of transmitting the whole image-sphere it could send only a portion of it based on how fast the user is likely to turn his head in the next N milliseconds or at a reduced frame rate, and since the raster viewport is clipped by the client, the user would still be able to look around at 60fps. Input-to-display lag would still exist but developers could overcome at least some of it by designing around this limitation.

The potential end-game could be something like a cloud-powered Oculus Rift style HMD with no console or PC required – in other words: no hassle VR that is just plug & immerse. The required tech is already available, both NVIDIA and AMD have announced support for GPU cloud rendering, OculusVR is finally shipping the $300 Rift development kits worldwide and all the required client-side processing could easily be handled by a $100 Android board.

(original post here: http://bitcortex.com/2013/04/08/streami ... the-cloud/)

Even though it is way too early for something like this, I find the idea quite appealing. What do you guys think?

Post by **STRZ** » Mon Apr 08, 2013 1:56 pm

This surely will happen one day, networking with minimal latency. maybe when quantum entagled networking gets popular http://arstechnica.com/science/2012/12/ ... e-thought/

Post by **geekmaster** » Mon Apr 08, 2013 2:00 pm

rfurlan wrote:... On the second prototype, I decided to use a monoscopic 360 degree camera instead. The remote head would transmit the whole image-sphere to the user’s machine and I would clip the viewport on the client-side using the tracker information – effectively eliminating head-tracking lag by doing it locally. The overall experience should be great even though the video feed from the remote head could be several hundred milliseconds behind real-time.

And here is how all of this intersects with VR: a cloud-VR server could render a 360 degree image sphere around the player, transmit the whole frame to the client which would then clip the viewport based on the orientation of the user’s head. It could even be done adaptively to save bandwidth – instead of transmitting the whole image-sphere it could send only a portion of it based on how fast the user is likely to turn his head in the next N milliseconds or at a reduced frame rate, and since the raster viewport is clipped by the client, the user would still be able to look around at 60fps. Input-to-display lag would still exist but developers could overcome at least some of it by designing around this limitation. ...

You are describing the same thing I did in in my "PTZ Tweening" thread, including head tracked locally de-warped 360-degree frame content (perhaps from a "huge FoV" camera), delivered over a potentially slow network connection. Except you propose using a 360-degree mirror ball (or mirror dome) instead of the 180-degree fisheye lens that I suggested.

I have been experimenting with simple (Raspberry Pi compatible) code to convert 360-degree spherical panoramas to Rift pre-warp, with head-tracked panning around in the current "extend FoV" image (while waiting for the next low framerate image). That thread contains an ongoing active discussion of content applicable to your new thread (and other related content).

At http://www.mtbs3d.com/phpBB/viewtopic.php?f=138&t=16543, geekmaster wrote:"PTZ Tweening" for low-power low-latency head-tracking

SYNOPSIS: "PTZ Tweening" as discussed in this thread is a variation of "digital image stabilization" that is used here to stabilize the most recent unwarped framebuffer image relative to the tracked head position in the virtual world, in the absence of newly rendered video frames. It is essential to maintain high-speed low-latency head tracking to anchor the head position in the virtual world, even in underpowered or harsh computing environments. The method proposed here meets these requirements to a sufficient degree, even with a low-speed high-latency rendered video source, by uncoupling fast head tracking from potentially slow frame rendering. The goal here is to provide low-latency head tracking in a low-power environment, or to serve as a low-overhead backup method for head-tracked approximate lost frame synthesis in a desktop or "backtop" VR environment.
...
Now imagine yourself in the Oculus Rift HMD, looking at an unwarped portion of a spherical projection (perhaps from a camera with a fisheye lens). When you move your head, a nearby portion of the spherical projection is unwarped so that your view moves slightly to track your head rotation. This would be immersive. Experimental evidence shows that you can substitute a small amount of rotation for sideways movement, and you can substitute a small amount of zoom for forward or backward movement. You can even tilt the image to match head tilt to toward the shoulders. And moving toward or away from the horizon can be simulated by adjusting the zoom.

The process described above would allow creating tweened (linearly interpolated) frames that are perceived as BELONGING in between sequentially rendered frames, when the content difference between the frames is low. In the case of head tracking, we can use simple displacement mapping to simulate small head motions with very low latency, while waiting for the next "real" frame from the rendering pipeline. This allows us to render the world at a slow pace, such as when using low-power graphics hardware such as the Raspberry Pi.

We can even do this in a client/server arrangement where the PTZ Tweening is performed by the display device (i.e. Raspberry Pi) while the heavy rendering is done by a remote PC over a wireless connection.

It seems that when an idea is ready to be "invented", it will be, by multiple people. Just look at the parallel invention of the telephone, with patents applied for by different people only two hours apart...

Reddit · Post by **nateight** » Mon Apr 08, 2013 4:15 pm

Welcome back from your trip, rfurlan! After reading bits and pieces of the DIY Rift thread (one of the most fascinating things on the Internet, IMO) I had wondered about your lengthy absence. I'm excited to see what craze you kick off next!

Sadly, I don't think it's going to be head-tracked telepresence, at least not in anything like the near term. I quite like the underlying concept, and geekmaster's ongoing experiments should prove most of the technology is workable, but I think the "Cloud" portion of this proposal is where it currently runs aground. Consider this now-infamous tweet by John Carmack:

@ID_AA_Carmack wrote:I can send an IP packet to Europe faster than I can send a pixel to the screen. How f’d up is that?

A discussion here identifies that while this sentiment can be technically correct, it really assumes a one-way UDP packet traveling on the Platonic ideal of a "network". An unbroken single thread of glass thousands of miles long existing solely to transmit a packet of information probably can reach from the US to Europe in something like ~60ms, but this completely ignores the troublesome realities of doing such a thing in the real world. What you're suggesting may be practical on, say, the Internet2 network across a few hundred miles, but I've always been of the opinion that anything you can only demonstrate under academic circumstances isn't likely to set the world on fire. If we assume Google Fiber is the very best residential ISP in the US, take a look at some key pingtimes from this real-world test. Even under ideal circumstances, an ICMP echo request round-trip such as these represents the lowest possible time to send one unacknowledged UDP packet from the headtracker to the camera-bot and transmit the first packet of position-updated camera data in reply.

Something like ~10ms to send one packet of data from Kansas City to Dallas is great news! But note that's all strictly on Google's own network, and also (if my math is correct) you need a bare minimum of 42 IPv4 UDP packets to construct a single frame of even 720p video. Video games can get away with relatively large pingtimes and slow pipes because control, position, and server information packets contain only vanishingly small snippets of data which a client-side engine processes; an array of cameras will have no such luxury. Perhaps a forward-thinking programmer could re-engineer current display technology to fill frame buffers adaptively, to catch, decode, and reconstruct each frame 65,507 bytes at a time; I wouldn't know, I'm already out of my depth here. What is certain is that for most traffic outside Google's own network, the transmission time for a single round-trip packet falls somewhere in the 40-200ms range (this is the "best" NA ISP, remember), and if we proceed from the supposition that sub-20ms is what it takes for mediated reality to feel "natural" and supra-50ms makes it feel "barfy", we're very clearly in trouble until everything is on Google's own fiber-based network.

Of course, this last part is essentially the nut that Google Fiber has set out to crack (or so I and my single-provider market dearly hope). Perhaps in 10 years we'll all forget what online privacy even was and Google will be the primary ISP in North America, telepresence packets will be afforded the highest transmission priority and to hell with that network neutrality stuff. I wouldn't be shocked by this outcome; indeed, at closer to 20 years out, I'd start to be amazed this exact chain of events didn't occur. I think this is what it's going to take for general use telepresence to occur, though; the even worse news is all this stuff is also a major sticking point in the creation of the VR MMO games I most desperately want to play.

Please note, I am a "networking guy" but I don't have much in the way of practical experience beyond pulling the occasional cable, and my understanding of a lot of the hardware challenges inherent in VR is shaky at best. If someone out there can provide a thorough refutation of my thesis here, I'd actually be thrilled.

EDIT: I'll continue to give this some thought, because it's clear I haven't fully grasped the possibilities yet. For a user-movable camera-bot the problems above would appear to be a limiting factor, but what about starting with a fixed-position 360º camera array, or even one attached to a vehicle under control only at the remote end? Now you can essentially turn a real-time scene into a video game representation, and all that network latency boils down to something more closely resembling a buffer. If the original position for the head-tracked camera was stationary, suddenly it may not matter if the actual camera data took several seconds to arrive - whatever current data the user's PC had would be what the "engine" would portray, and while the user couldn't control the motion of the origin point without jarring lag, everything would appear real-time despite being arbitrarily delayed by networking realities? This may still run headlong into the problem of what happens to the vestibular system when any amount of avatar motion occurs without being initiated by the user's own brain, but that's going to be a problem for any kind of VR/AR until it's solved by something like galvanic stimulation.

Post by **phoenix96** » Mon Apr 08, 2013 4:51 pm

nateight wrote:Welcome back from your trip, rfurlan! After reading bits and pieces of the DIY Rift thread (one of the most fascinating things on the Internet, IMO) I had wondered about your lengthy absence. I'm excited to see what craze you kick off next!

Sadly, I don't think it's going to be head-tracked telepresence, at least not in anything like the near term. I quite like the underlying concept, and geekmaster's ongoing experiments should prove most of the technology is workable, but I think the "Cloud" portion of this proposal is where it currently runs aground. Consider this now-infamous tweet by John Carmack:
@ID_AA_Carmack wrote:I can send an IP packet to Europe faster than I can send a pixel to the screen. How f’d up is that?
A discussion here identifies that while this sentiment can be technically correct, it really assumes a one-way UDP packet traveling on the Platonic ideal of a "network". An unbroken single thread of glass thousands of miles long existing solely to transmit a packet of information probably can reach from the US to Europe in something like ~60ms, but this completely ignores the troublesome realities of doing such a thing in the real world. What you're suggesting may be practical on, say, the Internet2 network across a few hundred miles, but I've always been of the opinion that anything you can only demonstrate under academic circumstances isn't likely to set the world on fire. If we assume Google Fiber is the very best residential ISP in the US, take a look at some key pingtimes from this real-world test. Even under ideal circumstances, an ICMP echo request round-trip such as these represents the lowest possible time to send one unacknowledged UDP packet from the headtracker to the camera-bot and transmit the first packet of position-updated camera data in reply.

Something like ~10ms to send one packet of data from Kansas City to Dallas is great news! But note that's all strictly on Google's own network, and also (if my math is correct) you need a bare minimum of 42 IPv4 UDP packets to construct a single frame of even 720p video. Video games can get away with relatively large pingtimes and slow pipes because control, position, and server information packets contain only vanishingly small snippets of data which a client-side engine processes; an array of cameras will have no such luxury. Perhaps a forward-thinking programmer could re-engineer current display technology to fill frame buffers adaptively, to catch, decode, and reconstruct each frame 65,507 bytes at a time; I wouldn't know, I'm already out of my depth here. What is certain is that for most traffic outside Google's own network, the transmission time for a single round-trip packet falls somewhere in the 40-200ms range (this is the "best" NA ISP, remember), and if we proceed from the supposition that sub-20ms is what it takes for mediated reality to feel "natural" and supra-50ms makes it feel "barfy", we're very clearly in trouble until everything is on Google's own fiber-based network.

Of course, this last part is essentially the nut that Google Fiber has set out to crack (or so I and my single-provider market dearly hope). Perhaps in 10 years we'll all forget what online privacy even was and Google will be the primary ISP in North America, telepresence packets will be afforded the highest transmission priority and to hell with that network neutrality stuff. I wouldn't be shocked by this outcome; indeed, at closer to 20 years out, I'd start to be amazed this exact chain of events didn't occur. I think this is what it's going to take for general use telepresence to occur, though; the even worse news is all this stuff is also a major sticking point in the creation of the VR MMO games I most desperately want to play.

Please note, I am a "networking guy" but I don't have much in the way of practical experience beyond pulling the occasional cable, and my understanding of a lot of the hardware challenges inherent in VR is shaky at best. If someone out there can provide a thorough refutation of my thesis here, I'd actually be thrilled.

But if a 360-degree image is streamed whole, the head tracking can be done on the local client and thus without significant latency. There'd obviously be a lag between when something happens in the real world and when you'd see it in the image, but I think it could work.

Reddit · Post by **nateight** » Mon Apr 08, 2013 4:56 pm

phoenix96 wrote:But if a 360-degree image is streamed whole, the head tracking can be done on the local client and thus without significant latency. There'd obviously be a lag between when something happens in the real world and when you'd see it in the image, but I think it could work.

Heh. While you were posting this, I was adding in this edit:

EDIT: I'll continue to give this some thought, because it's clear I haven't fully grasped the possibilities yet. For a user-movable camera-bot the problems above would appear to be a limiting factor, but what about starting with a fixed-position 360º camera array, or even one attached to a vehicle under control only at the remote end? Now you can essentially turn a real-time scene into a video game representation, and all that network latency boils down to something more closely resembling a buffer. If the original position for the head-tracked camera was stationary, suddenly it may not matter if the actual camera data took several seconds to arrive - whatever current data the user's PC had would be what the "engine" would portray, and while the user couldn't control the motion of the origin point without jarring lag, everything would appear real-time despite being arbitrarily delayed by networking realities? This may still run headlong into the problem of what happens to the vestibular system when any amount of avatar motion occurs without being initiated by the user's own brain, but that's going to be a problem for any kind of VR/AR until it's solved by something like galvanic stimulation.

Like geekmaster said, great minds think alike?

Reddit · Post by **nateight** » Mon Apr 08, 2013 6:04 pm

For reference, check out this 360º "mouse-tracking" video with something like six whole months of latency. If your buffer is full of data it doesn't matter how far behind reality what you're seeing is - the effect is still immersive. That said, I have a hard time seeing anyone but the most iron-stomached among us watching such a video in a Rift, and even if letting someone else control your body's orientation isn't sickening, the network latency involved in moving the camera itself remains problematic.

Still, this is eventually going to be a "thing", I'm certain of it.

Post by **geekmaster** » Mon Apr 08, 2013 6:17 pm

nateight wrote:For reference, check out this 360º "mouse-tracking" video with something like six whole months of latency.

I saw that before, and not a problem. Now I get a little queasy just watching it in a window in my browser. I was looking at fish in a large aquarium a couple of days ago, and the prismatic distortion as I moved my head side-to-side while standing close to the aquarium glass caused a little motion sickness. Maybe I need to cut back on the length of my Rifting sessions???

But yeah, with head tracking you would be a passenger above the rider's head in that video...

P.S. I figured out a way (during a dream) to film 360-degree videos like this in stereoscopic 3D, using two side-by-side "mirror ball" cameras to view forward and back (swap cameras when looking back). And for sideways view where you lose camera parallax, use a variation of the "Pulfrich Effect" to create 3D from a pair a current and delayed image from a single camera. The Pulfrich Effect only works for images moving past you, like filming out a side window in a moving vehicle, which is why we use a pair of cameras for the direction of travel. Is that a new idea, or are people already doing that?

Post by **Drewbdoo** » Tue Apr 09, 2013 3:29 pm

Nate, my only problem with your posts is that there's always a self depreciating comment like "I'm just a layman" or "I'm out of my depth here, so I'm just guessing". Your comments are always the most well-cited, and by that virtue, most well thought out, so give yourself some credit. In some way, we are all laymen. At least you have the ability to google enough to inform yourself well before joining in the conversation, but I always find you add to the conversation.

Anyway, there wouldn't have to be a loop (unless you were sending some kind of control information) really because it would just be a video stream. The image feed, if watched on a monitor in 2d, would appear to be a garbled mess, but imagine taking one frame of it and then digitally warping it around a sphere. Now imagine a digital camera placed inside that sphere feeding the rift with the view swiveling around from that center point. There's no reason it should take more bandwidth that streaming any 720p content and I'm doing that from youtube at the moment with my utorrent on.

Well, no, that's not true, because I think you'd want the image to be of a considerably high quality since your fov would be taking up only a fraction of the imagine... hmm.

Anyway, I think the only major hurdle I see is that it wouldn't be stereoscopic without another lens capturing another eye viewpoint. Other than that, I think this setup would work brilliantly and I'd love to have a quadrocopter with this attached to it and learn to fly a drone in first person perspective.

Edit: Also, apparently I replied to an older version of the thread I had open from the other day and didn't see Nate's edits. It appears your eyes are open to the awesomeness of this idea

Post by **rfurlan** » Sat Apr 13, 2013 1:07 pm

That is great to hear Geekmaster, please keep us posted on your progress

I think this technique will be very useful for telepresence too, the prototype I was working on was for a remote operated robot - imagine how cool it would be to use a Oculus to visit, for example, the international space station?

Using an wide FoV camera instead of an actuated camera also has the advantage that we can multiplex it: hundreds of people could experience their own independent viewpoints simultaneously from a single remote camera.

geekmaster wrote:
rfurlan wrote:... On the second prototype, I decided to use a monoscopic 360 degree camera instead. The remote head would transmit the whole image-sphere to the user’s machine and I would clip the viewport on the client-side using the tracker information – effectively eliminating head-tracking lag by doing it locally. The overall experience should be great even though the video feed from the remote head could be several hundred milliseconds behind real-time.

And here is how all of this intersects with VR: a cloud-VR server could render a 360 degree image sphere around the player, transmit the whole frame to the client which would then clip the viewport based on the orientation of the user’s head. It could even be done adaptively to save bandwidth – instead of transmitting the whole image-sphere it could send only a portion of it based on how fast the user is likely to turn his head in the next N milliseconds or at a reduced frame rate, and since the raster viewport is clipped by the client, the user would still be able to look around at 60fps. Input-to-display lag would still exist but developers could overcome at least some of it by designing around this limitation. ...
You are describing the same thing I did in in my "PTZ Tweening" thread, including head tracked locally de-warped 360-degree frame content (perhaps from a "huge FoV" camera), delivered over a potentially slow network connection. Except you propose using a 360-degree mirror ball (or mirror dome) instead of the 180-degree fisheye lens that I suggested.

I have been experimenting with simple (Raspberry Pi compatible) code to convert 360-degree spherical panoramas to Rift pre-warp, with head-tracked panning around in the current "extend FoV" image (while waiting for the next low framerate image). That thread contains an ongoing active discussion of content applicable to your new thread (and other related content).
At http://www.mtbs3d.com/phpBB/viewtopic.php?f=138&t=16543, geekmaster wrote:"PTZ Tweening" for low-power low-latency head-tracking

SYNOPSIS: "PTZ Tweening" as discussed in this thread is a variation of "digital image stabilization" that is used here to stabilize the most recent unwarped framebuffer image relative to the tracked head position in the virtual world, in the absence of newly rendered video frames. It is essential to maintain high-speed low-latency head tracking to anchor the head position in the virtual world, even in underpowered or harsh computing environments. The method proposed here meets these requirements to a sufficient degree, even with a low-speed high-latency rendered video source, by uncoupling fast head tracking from potentially slow frame rendering. The goal here is to provide low-latency head tracking in a low-power environment, or to serve as a low-overhead backup method for head-tracked approximate lost frame synthesis in a desktop or "backtop" VR environment.
...
Now imagine yourself in the Oculus Rift HMD, looking at an unwarped portion of a spherical projection (perhaps from a camera with a fisheye lens). When you move your head, a nearby portion of the spherical projection is unwarped so that your view moves slightly to track your head rotation. This would be immersive. Experimental evidence shows that you can substitute a small amount of rotation for sideways movement, and you can substitute a small amount of zoom for forward or backward movement. You can even tilt the image to match head tilt to toward the shoulders. And moving toward or away from the horizon can be simulated by adjusting the zoom.

The process described above would allow creating tweened (linearly interpolated) frames that are perceived as BELONGING in between sequentially rendered frames, when the content difference between the frames is low. In the case of head tracking, we can use simple displacement mapping to simulate small head motions with very low latency, while waiting for the next "real" frame from the rendering pipeline. This allows us to render the world at a slow pace, such as when using low-power graphics hardware such as the Raspberry Pi.

We can even do this in a client/server arrangement where the PTZ Tweening is performed by the display device (i.e. Raspberry Pi) while the heavy rendering is done by a remote PC over a wireless connection.
It seems that when an idea is ready to be "invented", it will be, by multiple people. Just look at the parallel invention of the telephone, with patents applied for by different people only two hours apart...

Post by **Evenios** » Sat Apr 13, 2013 1:17 pm

maybe we could live out someone elses life? like a VR Truman show where we can plug in and see what someone experiences 24/7 haha

Streaming VR from the Cloud with the Rift

Streaming VR from the Cloud with the Rift

Re: Streaming VR from the Cloud with the Rift

Re: Streaming VR from the Cloud with the Rift

Re: Streaming VR from the Cloud with the Rift

Re: Streaming VR from the Cloud with the Rift

Re: Streaming VR from the Cloud with the Rift

Re: Streaming VR from the Cloud with the Rift

Re: Streaming VR from the Cloud with the Rift

Re: Streaming VR from the Cloud with the Rift

Re: Streaming VR from the Cloud with the Rift

Re: Streaming VR from the Cloud with the Rift