Tuesday, April 24, 2012

I have been debugging a weird bug for a week. It causes random segmentation fault for no obvious reason. Since I've decided to reinvent the wheel somehow, to start from the very beginning and use opencv and pcl to build a casual version of the project, life is hard for every tiny problem including the incompatibility of pcl-1.1 with pcl-1.5, opencv2.3 and opencv2.0. The result so far is far from satisfactory.

Wednesday, April 18, 2012

OpenCV

There's some problems in integrating CUDA SURF with ROS, so I am not shifting to OpenCV GPU support. The libopencv2.3 was not compiled with CUDA support thus need to recompile OpenCV 2.3 with CUDA flag.

Now mainly testing offline as ROS provides rosbag to log and play data for realtime processing and debugging. The CUDA SURF needs roughly 0.02s for SURF detection and matching, while needs 0.08s for image loading and writing on 640x480 RGB8 images.


Continuous SURF detection and matching 

Sunday, April 1, 2012

Midpoint Check

To improve the performance of real-time implementation of RGB-D 3D environment reconstruction, the following algorithm could be considered to implement with CUDA.
1. SURF(Speeded Up Robust Feature) detector/descriptor/matching
2. RANSAC
3. ICP(Iterative Cloud Point)

By midpoint I have basically done testing GPU based SURF algorithm and modify the interface to be integrated in ROS. It relies on OpenCV, OpenSurf, CUDPP and CUDA SURF libraries.

Next is to combine RANSAC pose estimation and ICP into sequential kernels with minimum memory transfer. Real-time rendering with OpenGL in rviz node should be considered as well.

The midpoint check presentation could be downloaded here.

Wednesday, March 28, 2012

Preliminary Comparison of OpenSURF and CUDA SURF

OpenSURF[1] is an implementation of SURF feature detector/descriptor/matching in C++/C#. CUDA SURF[2] is an implementation of OpenSURF using CUDA SDK and CUDPP. Both use OpenCV to deal with basic image operations. CUDA SURF shares exactly the same function interface of OpenSURF so they are a reasonable pair to compare performance.


Here's a brief test on SURF algorithm using CPU vs GPU on the same computer(Intel Xeon 3.60GHz/4GB/Nvidia Quadro FX 5800/Ubuntu 11.04 32bit). The input images are shown.
Test Images from OpenSURF[1]


The preliminary test shows that both algorithm achieves good and similar results but CPU-based OpenSURF(0.65s) is 3x faster than GPU-based CUDA SURF(1.72s). I was quite surprised first and add timing probes to detect the difference of to implementation and found that CUDA SURF consumes numerous time in initializing to allocate memory(1.66s) and the rest part is far more faster than OpenSURF. It is potential doable for real-time processing as it only needs initialization once. More details will be tested and discussed later and I will try to optimized the CUDA SURF on this specific computer.


Code in Time(2) section


1:       // Allocate device memory  
2:       int img_width = src->width;  
3:       int img_height = src->height;  
4:       size_t rgb_img_pitch, gray_img_pitch, int_img_pitch, int_img_tr_pitch;  
5:       CUDA_SAFE_CALL( cudaMallocPitch((void**)&d_rgb_img, &rgb_img_pitch, img_width * sizeof(unsigned int), img_height) );  
6:       CUDA_SAFE_CALL( cudaMallocPitch((void**)&d_gray_img, &gray_img_pitch, img_width * sizeof(float), img_height) );  
7:       CUDA_SAFE_CALL( cudaMallocPitch((void**)&d_int_img, &int_img_pitch, img_width * sizeof(float), img_height) );  
8:       CUDA_SAFE_CALL( cudaMallocPitch((void**)&d_int_img_tr, &int_img_tr_pitch, img_height * sizeof(float), img_width) );  
9:       CUDA_SAFE_CALL( cudaMallocPitch((void**)&d_int_img_tr2, &int_img_tr_pitch, img_height * sizeof(float), img_width) );   


CPU-based OpenSURF(0.65s)
Matches: 76
Time(load):0.03000
Time(descriptor):0.56000
        Time(Integral):0.00000
        Time(FastHessian):0.00000
        Time(getIpoints):0.09000
        Time(descriptor):0.33000
        Time(cvReleaseImage):0.00000
        --------------------------------------
        Time(Integral):0.00000
        Time(FastHessian):0.00000
        Time(getIpoints):0.03000
        Time(descriptor):0.11000
        Time(cvReleaseImage):0.00000
Time(match):0.02000
Time(plot):0.00000
Time(save):0.04000


GPU-based CUDA SURF(1.72s)
Matches: 66
Time(load):0.02000
Time(descriptor):1.69000
        Time(Integral):1.68000
                Time(1):0.0000000000
                Time(2):1.6800000000
                Time(3):0.0000000000
                Time(4):0.0000000000
                Time(5):0.0000000000
                Time(6):0.0000000000
                Time(7):0.0000000000
                Time(8):0.0000000000
        Time(FastHessian):0.00000
        Time(getIpoints):0.00000
        Time(descriptor):0.00000
        Time(freeCudaImage):0.00000
        --------------------------------------
        Time(Integral):0.00000
                Time(1):0.0000000000
                Time(2):0.0000000000
                Time(3):0.0000000000
                Time(4):0.0000000000
                Time(5):0.0000000000
                Time(6):0.0000000000
                Time(7):0.0000000000
                Time(8):0.0000000000
        Time(FastHessian):0.00000
        Time(getIpoints):0.01000
        Time(descriptor):0.00000
        Time(freeCudaImage):0.00000
Time(match):0.01000
Time(plot):0.00000
Time(save):0.03000

             


CPU-based OpenSURF(0.65s)




GPU-based CUDA SURF(1.72s)

BTW, maybe there's a better way for timing which will increase the accuracy.[3]

             


Reference
[1]http://www.chrisevansdev.com/computer-vision-opensurf.html
[2]http://www.d2.mpi-inf.mpg.de/surf
[3]Measuring Computing Times and Operation Counts
of Generic Algorithms, http://www.cs.rpi.edu/~musser/gp/timing.html

CUDPP

CUDPP is the CUDA Data Parallel Primitives Library. CUDPP is a library of data-parallel algorithm primitives such as parallel prefix-sum (”scan”), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables.[1]


As SURF uses integral images for fast convolution, it is essential to calculate summed-area table by CUDA. CUDPP is an alternative and dependency of CUDA SURF.[2]


When I tried to compile the CUDPP library, I found it extremely slow so that I thought the computer died somehow. After waiting for decades of minutes it finally completed. Curiously I tried to figure out the reason and got an answer for the wiki of CUDPP.[3]



"Compile time continues to get longer as we add more functionality.  CUDA is really slow at 
compiling template functions with multiple parameters, and we use a lot. There are something like 
384 different scan kernels, for example, and a similar number for segscan. "





Reference
[1]CUDPP, http://code.google.com/p/cudpp/
[2]CUDA SURF http://www.d2.mpi-inf.mpg.de/surf
[3]http://code.google.com/p/cudpp/issues/detail?id=19