Wednesday, March 28, 2012

Preliminary Comparison of OpenSURF and CUDA SURF

OpenSURF[1] is an implementation of SURF feature detector/descriptor/matching in C++/C#. CUDA SURF[2] is an implementation of OpenSURF using CUDA SDK and CUDPP. Both use OpenCV to deal with basic image operations. CUDA SURF shares exactly the same function interface of OpenSURF so they are a reasonable pair to compare performance.


Here's a brief test on SURF algorithm using CPU vs GPU on the same computer(Intel Xeon 3.60GHz/4GB/Nvidia Quadro FX 5800/Ubuntu 11.04 32bit). The input images are shown.
Test Images from OpenSURF[1]


The preliminary test shows that both algorithm achieves good and similar results but CPU-based OpenSURF(0.65s) is 3x faster than GPU-based CUDA SURF(1.72s). I was quite surprised first and add timing probes to detect the difference of to implementation and found that CUDA SURF consumes numerous time in initializing to allocate memory(1.66s) and the rest part is far more faster than OpenSURF. It is potential doable for real-time processing as it only needs initialization once. More details will be tested and discussed later and I will try to optimized the CUDA SURF on this specific computer.


Code in Time(2) section


1:       // Allocate device memory  
2:       int img_width = src->width;  
3:       int img_height = src->height;  
4:       size_t rgb_img_pitch, gray_img_pitch, int_img_pitch, int_img_tr_pitch;  
5:       CUDA_SAFE_CALL( cudaMallocPitch((void**)&d_rgb_img, &rgb_img_pitch, img_width * sizeof(unsigned int), img_height) );  
6:       CUDA_SAFE_CALL( cudaMallocPitch((void**)&d_gray_img, &gray_img_pitch, img_width * sizeof(float), img_height) );  
7:       CUDA_SAFE_CALL( cudaMallocPitch((void**)&d_int_img, &int_img_pitch, img_width * sizeof(float), img_height) );  
8:       CUDA_SAFE_CALL( cudaMallocPitch((void**)&d_int_img_tr, &int_img_tr_pitch, img_height * sizeof(float), img_width) );  
9:       CUDA_SAFE_CALL( cudaMallocPitch((void**)&d_int_img_tr2, &int_img_tr_pitch, img_height * sizeof(float), img_width) );   


CPU-based OpenSURF(0.65s)
Matches: 76
Time(load):0.03000
Time(descriptor):0.56000
        Time(Integral):0.00000
        Time(FastHessian):0.00000
        Time(getIpoints):0.09000
        Time(descriptor):0.33000
        Time(cvReleaseImage):0.00000
        --------------------------------------
        Time(Integral):0.00000
        Time(FastHessian):0.00000
        Time(getIpoints):0.03000
        Time(descriptor):0.11000
        Time(cvReleaseImage):0.00000
Time(match):0.02000
Time(plot):0.00000
Time(save):0.04000


GPU-based CUDA SURF(1.72s)
Matches: 66
Time(load):0.02000
Time(descriptor):1.69000
        Time(Integral):1.68000
                Time(1):0.0000000000
                Time(2):1.6800000000
                Time(3):0.0000000000
                Time(4):0.0000000000
                Time(5):0.0000000000
                Time(6):0.0000000000
                Time(7):0.0000000000
                Time(8):0.0000000000
        Time(FastHessian):0.00000
        Time(getIpoints):0.00000
        Time(descriptor):0.00000
        Time(freeCudaImage):0.00000
        --------------------------------------
        Time(Integral):0.00000
                Time(1):0.0000000000
                Time(2):0.0000000000
                Time(3):0.0000000000
                Time(4):0.0000000000
                Time(5):0.0000000000
                Time(6):0.0000000000
                Time(7):0.0000000000
                Time(8):0.0000000000
        Time(FastHessian):0.00000
        Time(getIpoints):0.01000
        Time(descriptor):0.00000
        Time(freeCudaImage):0.00000
Time(match):0.01000
Time(plot):0.00000
Time(save):0.03000

             


CPU-based OpenSURF(0.65s)




GPU-based CUDA SURF(1.72s)

BTW, maybe there's a better way for timing which will increase the accuracy.[3]

             


Reference
[1]http://www.chrisevansdev.com/computer-vision-opensurf.html
[2]http://www.d2.mpi-inf.mpg.de/surf
[3]Measuring Computing Times and Operation Counts
of Generic Algorithms, http://www.cs.rpi.edu/~musser/gp/timing.html

CUDPP

CUDPP is the CUDA Data Parallel Primitives Library. CUDPP is a library of data-parallel algorithm primitives such as parallel prefix-sum (”scan”), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables.[1]


As SURF uses integral images for fast convolution, it is essential to calculate summed-area table by CUDA. CUDPP is an alternative and dependency of CUDA SURF.[2]


When I tried to compile the CUDPP library, I found it extremely slow so that I thought the computer died somehow. After waiting for decades of minutes it finally completed. Curiously I tried to figure out the reason and got an answer for the wiki of CUDPP.[3]



"Compile time continues to get longer as we add more functionality.  CUDA is really slow at 
compiling template functions with multiple parameters, and we use a lot. There are something like 
384 different scan kernels, for example, and a similar number for segscan. "





Reference
[1]CUDPP, http://code.google.com/p/cudpp/
[2]CUDA SURF http://www.d2.mpi-inf.mpg.de/surf
[3]http://code.google.com/p/cudpp/issues/detail?id=19

libopencv-2.3.1

In ROS(Electric) the package opencv2 is deprecated. Instead, libopencv-dev 2.3.1 is used as the source of OpenCV. However it comes with some troubles in linking the library as the PKG_CONFIG_PATH will automatically add -I to the libraries(2.3.1), there would be error when linking. Meanwhile the libopencv-dev 2.3.1 uses opencv-2.3.1.pc instead of opencv.pc in usr/lib/pkgconfig/, which will cause the following error when compiling out of ROS for the regular c/cpp source file although the pkg-config is set to the opencv-2.3.1.pc.


/usr/bin/ld: cannot find -l/usr/lib/libopencv_contrib.so.2.3.1
/usr/bin/ld: cannot find -l/usr/lib/libopencv_legacy.so.2.3.1
/usr/bin/ld: cannot find -l/usr/lib/libopencv_objdetect.so.2.3.1
/usr/bin/ld: cannot find -l/usr/lib/libopencv_calib3d.so.2.3.1
/usr/bin/ld: cannot find -l/usr/lib/libopencv_features2d.so.2.3.1
/usr/bin/ld: cannot find -l/usr/lib/libopencv_video.so.2.3.1
/usr/bin/ld: cannot find -l/usr/lib/libopencv_highgui.so.2.3.1
/usr/bin/ld: cannot find -l/usr/lib/libopencv_ml.so.2.3.1
/usr/bin/ld: cannot find -l/usr/lib/libopencv_imgproc.so.2.3.1
/usr/bin/ld: cannot find -l/usr/lib/libopencv_flann.so.2.3.1
/usr/bin/ld: cannot find -l/usr/lib/libopencv_core.so.2.3.1
collect2: ld returned 1 exit status
make: * [DisplayImage] Error 1


To solve the problem, the method that works on my computer is to create an additional .pc file at /usr/lib/pkgconfig/opencv.pc with the following contents.[1]


1:  # Package Information for pkg-config  
2:  prefix=/usr  
3:  exec_prefix=${prefix}  
4:  libdir=${exec_prefix}/lib  
5:  includedir_old=${prefix}/include/opencv-2.3.1/opencv  
6:  includedir_new=${prefix}/include/opencv-2.3.1  
7:  Name: OpenCV  
8:  Description: Open Source Computer Vision Library  
9:  Version: 2.3.1  
10:  Libs: -L${libdir} -lopencv_contrib -lopencv_legacy -lopencv_objdetect -lopencv_calib3d -lopencv_features2d -lopencv_video -lopencv_highgui -lopencv_ml -lopencv_imgproc -lopencv_flann -lopencv_core  
11:  Cflags: -I${includedir_old} -I${includedir_new}  


Reference
[1]about compiling opencv programs outside ROS, http://answers.ros.org/question/11916/about-compiling-opencv-programs-outside-ros
[2]Ticket #1475, https://code.ros.org/trac/opencv/ticket/1475

Dropbox Uploader in terminal

Since I am working on shaggy remotely these days, I need a way to transfer code and files to shaggy, execute and transfer them back to my computer. For security issue I need to ssh eniac.seas.upenn.edu and then ssh shaggy and this renders scp command annoying. There are a couple of ways to do it and here's my experience.

1. github
This works professionally, no doubts. Just clone the repository on shaggy and push and pull it.

2. scp
Need to scp to eniac first and then shaggy.

3. Dropbox uploader
There is a bash tool called dropbox_uploader[1] and I can upload files to my Dropbox within a single line.


Usage: ./dropbox_uploader.sh [OPTIONS]...
Options:
        -u [USERNAME] (required if not hardcoded)
        -p [PASSWORD]
        -f [FILE/FOLDER] (required)
        -d [REMOTE_FOLDER] (default: /)
        -v Verbose mode

It's convenient to send the results one way back to my computer. 

Tuesday, March 27, 2012

Update

Working on some paper about vision based slam these days. Tested(offline) built-in SIFT GPU code for sequential images last week and it significantly improve the speed. Now I am thinking to combine the SIFT/SRUF feature extraction and matching in single kernel, which would decrease the overhead for memory loading and writing. Working on shaggy, need to install some dependencies.

Tuesday, March 13, 2012

Reading Digest::SURF

I am reading SURF feature detector/descriptor/match these days. A major paper is by Herbert Bay and et al, called SURF: Speeded Up Robust Features[1].

Speeded Up Robust Features(SURF) is a kind of scale-invariant and rotation-invariant features based on Hessian Matrix. Compared with SIFT and other approach to point features, it approximates or even outperforms repeatability, distinctiveness and robustness, yet can be computed and compared much faster[1].

The SURF method could be applied to sequential frames sampled by the Kinect RGB camera as well as the structured IR depth camera for global map registration and pose estimation. The scale-invariant and rotation-invariant properties fit the requirement of 3D reconstruction.

According to the paper[1], I found a couple of potential implementations could be done on CUDA to improve performance of real-time processing. So far they are
Detector
(1) Integral image: SURF uses integral image for convolution, which means an image point is represented as the sum of all pixels in the input image of a rectangular region formed by the point and the origin. This could be optimized by using multiple threads with CUDA.
(2) Filter parallelism: The approximation of 2nd order Gaussian derivatives with integral image are independent of size, which enable parallelism in applying filters of different scales to images.
(3) Point-wise calculation: Convolution and other operation may involve point-wise operations, which could be optimized by well-designed kernel function with techniques such as coalescing, tiling. This idea should work with the rest of the steps.

Descriptor
more to read...

Matching
more to read...

Reference
[1] Herbert Bay, Tinne Tuytelaars, Luc Van Gool, SURF: Speeded Up Robust Features, 

Naive CPU-based RGB-D SLAM

Complied and ran the ROS packages[1] of RGB-D SLAM on my laptop again yesterday. The result looks cool but pretty laggy for real-time purpose. Here's a short video on of the result. The 3D reconstruction of my dormitory with cloud points are shown.




Reference
[1] Felix Endres, Juergen Hess, Nikolas Engelhard, http://www.ros.org/wiki/rgbdslam

CUDA-based Algorithm Resources

Installed ROS today. Found some resources about SIFT/SURT/ICP implementations on CUDA. These may be good references and pretty helpful with my project.

GPU SIFT: http://cs.unc.edu/~ccwu/siftgpu/
CUDA SURF: http://www.d2.mpi-inf.mpg.de/surf?q=surf
GPU ICP: http://home.hiroshima-u.ac.jp/tamaki/study/cuda_softassign_emicp/

Monday, March 12, 2012

Project Proposal


CIS 565 Final Project Pitch
GPU Accelerated RGB-D SLAM with Microsoft Kinect
Yedong Niu
03/11/2012



Background
The simultaneous localization and mapping (SLAM) problem asks if it is possible for a mobile robot to be placed at an unknown location in an unknown environment and for the robot to incrementally build a consistent map of this environment while simultaneously determining its location within this map[1]. While vision based SLAM is one of the most recent approaches in the SLAM community, RGB-D(epth) method with affordable Microsoft Kinect sensor is a typical implementation. The real-time application is a challenge as it deals with gigantic amount of point data with limited hardware resources on a mobile robot. GPU implementation may solve the problem above somehow.

Goal
My project aims to improve the performance of real-time 3D environment reconstruction[5] with Kinect by using CUDA. I will mainly focus on improving the efficiency of related computer vision algorithms including registration, feature extraction and matching(SIFT/SURF and RANSAC), and Iterative Closest Point (ICP) algorithm. My project will base on N. Engelhard’s paper[2] and apply GPU application on every appropriate step. The model takes 2 seconds per frame on Intel i&@2GHz[2], which is the baseline where I started. I may use some OpenCV[3] and PCL[4] GPU libraries if allowed.

The above 3D reconstruction is based on point clouds. An optional goal is to reconstruct the environment by geometry-based surfaces, which is more challenging but more rewarding in some applications such as virtual touch input[6]. As time is limited, I don’t know whether I could reach this goal finally.

6D SLAM with RGB-D Data from Kinect [5]

KinectFusion[6]

Reference
[1] Hugh Durrant-Whyte, Tim Bailey, Simultaneous Localization and Mapping: Part I, 2006
[2] N. Engelhard, F. Endres and etal, Real-time 3D visual SLAM with a hand-held RGB-D camera, 2011
[3] OpenCV GPU documentation 2.3, http://opencv.itseez.com/modules/gpu/doc/gpu.html, 2012
[4] PCL documentation, http://pointclouds.org/documentation/, 2012
[5] N. Engelhard, http://www.youtube.com/watch?v=XejNctt2Fcs, 6D SLAM with RGB-D Data from Kinect
[6] Shahram Izadi and etal, KinectFusion: Realtime 3D Reconstruction and Interaction Using a Moving Depth Camera, pp563, 2011



Saturday, March 10, 2012

Just a start

This is the official blog for my CIS 565 final project. The topic is about GPU-accelerated Kinect applications. The project aims to improve the performance of applications such as RGB-D SLAM, gesture recognition and real-time rendering. One specific topic will be choose in the following days.

During the spring break I looked into some topics about GPU based OpenCV and PCL libraries. They seems to be powerful for computer vision algorithms including registration, feature extraction and matching and the like. More to read and decide the roadmap of the project.