Monday, June 15, 2015

HOG People Tracker

This past semester I took a course in Computer Vision. The course culminated in a final project of our choosing, so I worked with my classmates, Luke Gessler and Minh Nguyen, to create a person-tracker using the Histogram of Oriented Gradients (HOG) descriptor.

In a nutshell, image descriptors are useful ways to compress an image into its most meaningful parts (its "features"), which usually takes the form of a vector. The HOG descriptor is a specific type of image descriptor that works very well for succinctly describing the shape of humans. This blog post has a good description of how the HOG descriptor works.

Here's an example output video from our code:

For more in-depth information about what we did, check out:

Our final write-up

Slides from our presentation

Our code repo

Wednesday, December 10, 2014

Hyperlapse Video With Python

For my final project in Computational Photography, I wrote software that would emulate the Hyperlapse video effect found in Instagram's Hyperlapse App.

For those who are unaware, Hyperlapse video is the equivalent of a motion time-lapse picture. Typically, this is created with a series of image stabilization and subsampling algorithms. Below is an example of a Hyperlapse created with Instagram's app:

I did some digging to try to find out how these videos are created, and I found that Instagram uses the gyroscope built-into the iPhone for motion stabilization. As each frame is captured, the frame is translated in an amount and direction that is opposite to the motion of the camera, thus creating a synthetically smooth motion. In addition to the stabilization, the application crops, or in some cases, stretches each frame so that no black borders are visible from the translation of each frame.

Microsoft Research created a similar piece of software, but instead of using a gyroscope, they processed the video based on information from the image frames. Using a computation-heavy process of reconstructing a 3D point cloud of the entire scene, calculating an optimally smooth camera path, and then re-rendering the synthetic frames, they were able to output Hyperlapse-esque video.


Not having the resources of Microsoft nor an iPhone, I decided to try making a Hyperlapse effect with Python and OpenCV. In general, my algorithm was the following:
  1. Calculate motion between each video frame with sparse optical flow, then integrate the motion in the time-domain to calculate an estimated position of each frame at each time step
  2. Filter the position data with a low-pass filter to smooth the motion
  3. Subsample frames to speed up the footage
  4. Filter the subsampled position data with a low-pass filter again to remove high-frequency data that was added from subsampling
  5. Calculate how much to crop the video in order to remove the black borders from frame translations (if too much is cropped, overlay the frames)
  6. Translate and render the stabilized and subsampled frames
While not perfect, it ended up working pretty well! You can download the code for this from Github:

Below are some videos I rendered with the program:

Friday, October 24, 2014

Gradient Domain Cloning

I'm taking Computational Photography this semester and we recently had a homework assignment to implement Gradient Domain Cloning. Gradient Domain Cloning is a technique for blending together two images so that they fit together seamlessly.

In the past, the naive approach to doing this would be to identify two images with similar backgrounds and simply past one on top of the other. Using wikipedia's creepy example of an eye on a hand, this is what would happen:


The result is okay, but generally not great. It turns out, however, that if we analyze the images in the gradient domain (the gradient part of Gradient Domain Cloning), we can automate this process!

What is the gradient domain, you may ask? It can mean several things depending on what field you're in, but for our purposes, it means quantifying the change in pixel values in the X and Y directions. If we consider each color channel of a pixel at a time, we can calculate the slope for each channel in the X-direction by finding the difference in color values between the pixel to the left and right of the current pixel, and then dividing it by 2 (the distance between the pixels). We can do the same process with the pixel above and below the current pixel to determine the slope in the Y-direction.

(editor's note: what I'm describing is a generalization of a gradient, in the context of image processing. In Mathematics, the gradient is a well-defined concept which wikipedia has a great article on)

Performing this gradient operation on an image and then visualizing the output gives us something similar to edge-detection filters:

(source: wikipedia)

Now, what would happen if we performed our naive approach from above, but this time doing it with the gradient of each image? Wikipedia comes to our rescue with another creepy eye/hand picture  to illustrate the result (Apple: you should patent this!):

(source: wikipedia)

This brings us one step closer to our goal of automated blending. Since we chose our two source images to have similar backgrounds (e.g. the eye and hand have the same skin tone), unsurprisingly, their gradient images should fit together nicely without any need for blending. The next step, then, would be to apply a magical mathematical operation to convert the combined image from the gradient domain to the original image domain.

Doing this is not so simple because we can't just do the functional inverse of taking a gradient (i.e. integrating in 2D); if we did, we would get back the naive approach's result. We need to take into account the colors in the foreground and background images so that they both match at the edge between the two images.

This paper goes into significant depth for how a solution was derived, but suffice it to say, we can reduce the problem down to solving a linear system of equations via Poisson's equation (in fact, Prof. Barnes simplified this to equation #7 on the homework assignment page). Once in this form, we can use SciPy's built-in solver for linear systems to solve for the blended image. Below is the result for our eye/hand example:

(source: wikipedia)

I've posted my code for this project on GitHub; the project is written in Python.

To create the image at the top of this post, use ron2.jpg as the foreground image, Mona_Lisa.jpg as the background image, and ron_matte.png to define the boundaries of the foreground image. All of these are found in the "imgs" directory on the GitHub page.

Happy cloning!