Fast sparse stereo-matching [Computer Vision]

Post by **FingerFlinger** » Fri Jul 05, 2013 1:12 am

As part of a larger project, I made a little bare-bones library for stereo matching of sparse features. It's very basic, and completely tailored to my own needs, but I thought it was nifty, so I spent a little time this evening cleaning it up and threw it on GitHub. Binaries here

Based on various papers and snippets, it keeps up with or exceeds the speed and performance of similar sparse matchers, despite running on 7 year-old hardware. For comparison, LibViso2 claims that it can match 1000 features in 35ms. The image above was matched in 7.7ms and matched 253 features.

Put in equivalent terms:
LibViso2: 28571 per second
My thing: 32857 per second

But, that's tuned for rock-solid matches. If you relax the matching criteria slightly, you can still have over 90% matching accuracy but get 469 matches in 5.6ms, for ~82000/s.

There are a few major optimizations left, but I am saving them for later, since it suits my needs as-is. One funny thing is that the algorithm is almost completely memory-limited, so my netbook(w/ DDR3 memory) can actually run this about 20% faster than my "gaming pc" from 2008.

Post by **cybereality** » Fri Jul 05, 2013 2:03 am

Looking good.

Post by **WiredEarp** » Fri Jul 05, 2013 4:31 am

Very cool FingerFlinger.

Post by **zalo** » Mon Jul 08, 2013 7:14 pm

What would a depth map composed of voronoi regions look like using these points?
Each region would be some grayscale color representing that point, and it might look interesting overlaid on the original image.

More points means better depth map!

By the way, is this used for you optical flow project? With my limited knowledge of optical flow algorithms, I imagine knowing depth would help immensely figuring out parallax and improving the accuracy of the track.

Post by **FingerFlinger** » Mon Jul 08, 2013 7:39 pm

Actually, when I get a little time, I am going to add a dense option to the library. I was perhaps a little preemptive when I named the repo SparseStereo.

There is no reason that you couldn't attempt to find a match for every pixel in the image, but the reason that I haven't done so is because I am targeting a real-time application. In my demo, the default settings generate about 10000 keypoints between the 2 images. That's simply many fewer points than a dense map would need to match (375*450 = 168750px), and therefore a lot faster to compute.

Yes, it is used for my optical flow/stereo visual odometry project. It doesn't necessarily improve the quality of the actual optical flow, but it helps you figure out how much weight to give each flow vector. i.e. vectors with a greater depth will have a smaller magnitude than vectors that are nearer, but they represent the same amount of camera motion in real-world terms.

Post by **Fredz** » Wed Jul 10, 2013 8:56 pm

FingerFlinger wrote:Actually, when I get a little time, I am going to add a dense option to the library. I was perhaps a little preemptive when I named the repo SparseStereo.

You may already know about it but you can find a lot of good implementations of dense stereo correspondance here : http://vision.middlebury.edu/stereo/eval/

FingerFlinger wrote:There is no reason that you couldn't attempt to find a match for every pixel in the image, but the reason that I haven't done so is because I am targeting a real-time application.

The second entry in the evaluation (ADCensus) is near real-time, it takes 0.1s on a GeForce GTX 480 with CUDA. I guess something simpler could be implemented a bit faster.

Post by **brantlew** » Wed Jul 10, 2013 10:09 pm

Impressive. Keep it comin...

Post by **FingerFlinger** » Wed Jul 10, 2013 10:11 pm

Yep, I am aware of the Middlebury evaluations; the whole website is great!

I'm doing sparse because there is a lot of stuff to do with each point of interest. I also need to do optical flow and outlier rejection in addition to stereo correspondence. All of those steps add up, and I don't think it is practical to go completely dense for what I am trying to do.

Given that, I anticipate that my current performance is already good enough for my needs, and optimization will come after I have gotten the rest of my algorithm implemented.

I'm definitely interested to work with CUDA, since SSE has been pretty fruitful for me.

And thanks for pointing me specifically to that second link! I haven't seen their paper before, but their solution utilizes the Census transform, which is also what my library uses (albeit in a radically different way), so I might be able to gain some insight from their CUDA implementation.

Post by **zalo** » Wed Sep 04, 2013 7:20 pm

I was trawling some ancient leap hacking threads, and I found this image:

Supposedly it's raw data from the leap's sensor (640x480 each). Despite the fact that each leap camera has ~150 degrees FoV (with substantial distortion), I'm curious as to how your stereo algorithms work on it.

The CTO of leap mentioned: "the lens distortions are nonlinear, non-monotonic, and fish-eye so they are very difficult to calibrate with existing vision packages. This is why we developed our own calibration method..." so it might be a lost cause in terms of creating dense stereo maps...

PS: If it does work, some guy made a raw data plug-in for Linux with some super sketchy code:
https://github.com/elinalijouvni/OpenLeap

Post by **FingerFlinger** » Wed Sep 04, 2013 7:58 pm

Not very well, I'd say!

EDIT: You can always do a manual calibration. OpenCV is great for that. I've got a half-finished tool that would automate it, actually. When life settles down a little bit, it's on the top of my list to finish...

Post by **android78** » Wed Sep 04, 2013 10:16 pm

It looks like they probably use the brightness as a rough depth map. Using that they can use very localized matching between the two images to get the final position.
Basically, for a pixel, given the brightness you could estimate the distance being seen as an inverse square of the brightness (multiplied by scaling factor). This can be validated by the second image. Then you can do matching within a small region to get an even more accurate position.

Post by **MSat** » Thu Sep 05, 2013 1:33 pm

android78 wrote:It looks like they probably use the brightness as a rough depth map. Using that they can use very localized matching between the two images to get the final position.
Basically, for a pixel, given the brightness you could estimate the distance being seen as an inverse square of the brightness (multiplied by scaling factor). This can be validated by the second image. Then you can do matching within a small region to get an even more accurate position.

I had figured it was going to be something along those lines, which make it somewhat similar to the original Kinect (more densely illuminated objects are closer). You can probably further deduce depth information by simple comparison of overlapped stereo pairs - the fact that it's monochromatic probably makes it much easier.

Post by **zalo** » Fri Sep 06, 2013 6:49 pm

My thoughts exactly

Fast sparse stereo-matching [Computer Vision]

Fast sparse stereo-matching [Computer Vision]

Re: Fast sparse stereo-matching [Computer Vision]

Re: Fast sparse stereo-matching [Computer Vision]

Re: Fast sparse stereo-matching [Computer Vision]

Re: Fast sparse stereo-matching [Computer Vision]

Re: Fast sparse stereo-matching [Computer Vision]

Re: Fast sparse stereo-matching [Computer Vision]

Re: Fast sparse stereo-matching [Computer Vision]

Re: Fast sparse stereo-matching [Computer Vision]

Re: Fast sparse stereo-matching [Computer Vision]

Re: Fast sparse stereo-matching [Computer Vision]

Re: Fast sparse stereo-matching [Computer Vision]

Re: Fast sparse stereo-matching [Computer Vision]