CSE576 Final Project
Motion Tracking

Author: Jeff Cole
Date: 6/02/06

    Report Contents:
  1. Overall design of system
  2. General Methodology for Tracking
  3. Algorithm
  4. Experiments and Results
  5. Conclusions
  6. Source Code
  7. References

Overall design of system

The goal of this assignment was to design and implement a tracking system that could track moving vehicles in videos. Video tracking can be accomplished in a variety of different ways including optical flow clustering, template matching, and using interest operators. For this assignment I chose to use an interest operator to describe and search for the objects I am tracking. I chose to use the SIFT operator in my system because it is rotation invarient and also because it has been shown to perform extremely well with large databases of descriptors.

Methodology for Tracking

My general methodology is to detect SIFT features for objects and then repeatedly update the SIFT descriptors throughout the course of the video sequence. First, the user initializes the tracking by drawing a box around any object that s/he wants to track. The system then detects and saves all the SIFT features within that user-selected region. Then the system begins processing all of the succesive frames in the video sequence. In each new frame it looks within a padded region around the objects location in the previous frame to locate the object in the new frame. The key here is that the padding parameter must be set large enough that the object will never move faster than the padding window size in two successive frames. The padding parameter is a free parameter that must be set by the user. For all of my experiments I set the padding parameter to be between 9 and 12 pixels.

Since objects in videos are likely to change appearance during the course of the video (either due to motion, rotation, or lighting variations) I decided to include a re-calibration stage at repeated intervals throughout the video. So that after every N frames, the system redefines what SIFT features describe the object it is tracking. The update-rate parameter must be chosen by the user. For all of the sequences I tested in this project, the update-rate was set to recalibrate every 30 frames.

Also, I added an emergency situtation detector which detects that no features were found in the current frame and it tells the system to do an emergency recalibration as soon as a descriptor is noticed in any subsequent frames. This emergency updater becomes necessary when an object undergoes a very rapid change in appearance.

Algorithm Description

Experiments and Results

To test my method, I processed a number of video sequences of cars and tanks on runways and on dusty desert roads. The first few videos are the easiest to track and they become more and more difficult as you go further down the page. Each vehicle that is being tracked is shown with a unique colored box to indicate the tracking. Also, each of the detected SIFT features is marked with an 'x'. Any time that an object's set of descriptors is re-calibrated, an orange circle will blink over the object as a visual cue that recalibration has occured. Click on any of the images below to watch the tracking results.

Tracking 019
This video sequence consists of two cars that are stationary on a runway. The arial view has some motion and jitter, but the view does not change much during the sequence making it rather easy to track.

Tracking 027
This video sequence consists of four parked cars on a runway. Like the previous sequence, the view does not change significantly during this sequence and the appearance of each of the cars remains fairly constant, making tracking rather easy. There is also a person walking from one car to another during this sequence, but my method was unable to track the person (even though I initialized it to try). The reason it does a poor job of tracking the person is mostly because he is so small. When an object only takes up a couple of pixels it is rather difficult to generate a useful descriptor.

Tracking 032
This video sequence consists of a dark green truck driving on a dirt path with very low shrubbery. The truck travels for a few hundred meters during the sequence, but the person filming did a good job of holding the view of the truck steady. Also, the truck does not make any turns so its appearance remains very constant throughout the sequence. These factors all make it quite easy for my SIFT matching method to follow the truck.

Tracking 045
For this sequence, two trucks are parked on a dirt road. There is very minimal movement of the camera making this another relatively easy tracking task. However, if we look closely at the tracking result, we see a potential weakness for my proposed method. Notice that during the recalibration stage (every time you see an orange circle blink) the detected object region grows slightly. By the time the full sequence is processed, my tracking has started to identify the road and bushes near the truck as being part of the truck. Since the method is constantly relearning what an object looks like it is possible for it to think that interesting points right next to the object are also part of the object. Essentially, the biggest strength of my algorithm (its ability to relearn an object's appearance) can also be its biggest weakness.

Tracking 059
This video sequence consists of five cars and a motorcyle. We see here that my proposed tracking method works well even when there are a number of similar looking objects to track. Even the motorcyle rider is tracked for half of the sequence. The reason he is not tracked some of the time is that since he is so small, the texture of the road behind him is included in the SFIT feature that describes him. So when he rides over the white paint on the runway, the tracking looses him.

Tracking 017
This video sequence consists of a car passing a truck on a highway. My tracking method performs quite well for this sequence and recovers nicely when the car temporarily goes behind some trees. Originally I was worried about this video sequence confusing my method because the car actually ends up partially blocking the view of the truck. And it would be possible that my method would relearn the features of the car and truck precisely at the moment that they overlap (thus the tracker would think they were a single object). But the prominent SIFT features for the car and truck are separated enough spatially that the algorithm does not get thrown off. I suspect, that if the car in front was spatially surrounded by the truck during a re-calibration frame, then the tracker would relearn the two cars as a single car.

Tracking Tank-EO
This video sequence consists of a single tank moving quickly through a dirt road. For many methods, this sequence is quite difficult to track because the camera work is very shakey, the background has alot of distracting clutter, and the tank is constantly turning and thus changing its appearance. But my algorithm is designed specifically with these types of issues in mind. By continually recalibrating the SIFT features that describe the tank, we are able to easily track the tank through all of the twists and turns it takes.

Tracking Vehicles-people-EO
This video sequence shows a tank moving around in a desert. It is extremely difficult to track because the camera is very shakey, it zooms in and out, it often looses sight of the tank, and dust clouds often engulf the tank and hide it from view. This is the only sequence I tested where my method failed to track the target for the entire sequence. However, it still performed extremely well for the first 2/3 of the video, recovering gracefully twice when the cameraman temporarily looses sight of the tank. It even succesfully tracks the tank when it makes a sudden stop and a dust cloud engulfs the tank and blocks it from the view. But when the cameraman changes the zoom drastically while the tank is outside the field of view, the program is not able to correctly update its descriptors of the tank and it is unable to recover.


Overall, my system performs very well on all the videos I analyzed. By frequently recalibrating the descriptors for each object, the system is able to adapt to the changes that naturally occur in video sequences. As was shown in the experiments listed above, the system does a very impressive job of tracking multiple objects in cluttered environments despite poor camera work and poor video quality. The biggest issues for my system came when objects changed appearance too quickly or changed appearance while they were off the screen.

One of the biggest benefits of my method is its simplicity. There are actually very few steps required for the algorithm to track each object. And since the search regions are limited for each object, the processing time for a single frame is quite modest. On a 1.7GHz processor running MATLAB, my code can process about 2-5 frames per second (depending on the size and number of objects to track.) Probably, if implemented in a faster language such as C or Java, my method could be used for realtime tracking.

One draw back of my system is that it has two free parameters that must be set by the user (padding and recalibration rate). However, with more time to explore my methodology, I believe that both of these parameters could have been either removed or set to automatically adapt during the course of analyzing the video sequence. For example, the update rate could probably be completely eliminated and instead, recalibration could be triggered whenever the number of detected features falls below some threshold. Due to time constraints I was not able to explore these ideas, but they would be good directions for future work.

Another drawback of my system is that it requires that the user hand-initialize the locations of the objects to track. An important next step in the development of this methodology would be to work on ways of automatically detecting interesting objects in the frame (although this is an extremely difficult task, if not impossible). Then the user would not have to initialize the tracking and when objects are lost by the tracker, they could be automatically reinitilized by the system

Source Code

I wrote this algorithm as a MATLAB script along with the freely available SIFT feature detection/generation code from David Lowe's website (http://www.cs.ubc.ca/~lowe/keypoints/).

Presentation (5 minutes)



David G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, 60, 2 (2004), pp. 91-110. [PDF]

Iryna Gordon and David G. Lowe, "Scene modelling, recognition and tracking with invariant image features," International Symposium on Mixed and Augmented Reality (ISMAR), Arlington, VA (Nov. 2004), pp. 110-119. [PDF];

Stephen Se, David G. Lowe and Jim Little, "Global localization using distinctive visual features," International Conference on Intelligent Robots and Systems, IROS 2002, Lausanne, Switzerland (2002), pp. 226-231. [PDF];

Shaohua Kevin Zhou, Chellappa, R. and Moghaddam, B. "Visual tracking and recognition using appearance-adaptive models in particle filters," Image Processing, IEEE Transactions on (2004), pp.1491-1506. [PDF]