An integral part of the Vampire system is the visual tracking component. Visual tracking divides into two main tasks:
Object Tracking is directly connected to the action recognition and object recognition and learning subsystems. Visual object trackers provides the object learning with images of the segmented objects of interest. The action classifier receives trails of the objects' positions and recognises certain user actions.
Which regions in an image sequence are worth to be tracked? In VAMPIRE, tracking of objects is initiated from or object recognition results. Visual tracking of the object region allow to compute a trajectory to be fed into action recognition modules, to acquired different views of an object for learning of new objects. Several approaches coping with different requirements are studied in VAMPIRE.
Colour based object tracking
The object tracking system of Vampire has to fulfil several requirements. It has to
be real-time capable (i.e. process at least 15 frames per seconds),
be robust against illumination changes, occlusions, and other kinds of appearance changes,
tolerate moving cameras,
cope with strong object movement in the image plane,
be completely data-driven (i.e. work without model knowledge of the object).
Several different object tracking approaches were investigated. Especially colour histogram based techniques proved to be robust and accurate. For the Vampire system, we use HS-V colour histogram features. Pixels that are weakly saturated are accumulated in a V histogram, the others are inserted into a HS histogram. An example for HS-V histograms is given in the following illustration.
Histogram-based tracking algorithms detect the region whose colour histogram is most similar to the object's colour histogram. This is an optimisation-based approach and different optimisation techniques were evaluated. One of them is a probabilistic approach that applies a particle filter. This method proved to be very robust even in case of strong object movements. Methods based on local optimisation techniques (e.g. the mean-shift algorithm) showed to be computationally efficient (less than 9 milliseconds per frame) and very accurate. In the video below, the Vampire helmet was tracked by the colour histogram-based particle filter.
It is not only important to know the 2-D coordinates of the object in the image plane. 3-D information about the object's location is also a vital contribution to the whole system. For this, we investigated multi occular object tracking approaches. One approach that provides very fast and accurate 3-D estimations has an average error of less than one centimetre in the conducted experiments.
Several different algorithms for model-based tracking were evaluated. A probabilistic extension of the hyperplane tracker, which uses a large number of reference templates obtained in a training step, was implemented. Another approach, the 3-D hyperplane tracker, uses a 3-D model of the tracked object to estimate the position and orientation of the object. The point features of the 3-D model are acquired with the scale-invariant feature transform (SIFT), which is also used for initialising the 3-D tracking algorithm. Finally, a tracker that combines a 3-D model of SIFT features with a data-driven feature point tracker for robust real-time estimation of the pose of the object was developed. All of these approaches have the ability to estimate all six pose parameters in real-time. The last approach works in real-time and proved to be the most robust in our experimental evaluations. In the following video sequence, a package of juice was tracked with this approach. Red dots illustrate points which are tracked independently by the feature point tracker. Small green dots represent reprojected model points.
If the camera is at a fixed position (i.e. confined to rotating and zooming), the global motion in the image domain can be represented by a simpler geometric model - a homography. The tracked points from the low-level tracker can be used to determine the homography and hence to track the motion of the whole scene.
Tracking the ball in tennis video is a great challenge, because the ball is very fast compared to the usual 25Hz image refresh rate of common video material. Thus, the large distance the ball covers from one image to the other complicates visual tracking.
In foreground blob recognition, tennis ball candidates are detected, with possibly false positives and false negatives. Since a trajectory is a collection of ball candidates, tennis ball tracking problem can then be considered as a searching problem: to search an optimal trajectory from all possible combination of candidates, which minimise some kind of cost function. Such a cost function can be constructed using mismatch between predicted positions and observed positions of the object, i.e., the innovation, in some way. A simple one, for instance, would be the cumulative likelihood defined in the standard Kalman filter. In practice, however, there is a major difficulty in performing such an optimal searching. Since the number of possible trajectories grows exponentially as new frames are acquired, the computational power needed will soon become astronomical figure. We propose a sub-optimal searching algorithm, which trades optimality for computational efficiency. The standard Kalman filter is extended to accommodate multiple hypotheses.
Making multiple hypotheses
We also developed a tennis ball tracking algorithm based on particle filter. By separating object detection and object tracking, drawing samples directly from posterior density becomes possible. As a result, an improved sampling efficiency is achieved. By incorporating tennis player tracking result, two dynamic models switch automatically according to the distance between a particle and the tennis players. Tracking robustness against abrupt motion change is increased. The trajectory is sufficiently accurate for key event detection.
Particle filter based ball tracker
Tennis player tracking
A relatively simple method for tracking the players in a tennis court, albeit quite efficient in this context, is to initially (for the first frame, that is) detect the players at any plausible court position (but preferably close to the court lines, as they always stand there at the beginning of play) and then only allow them to move within a small distance from one frame to the next. But we also adopt an adaptive colour-based particle filter to track the tennis players. Foreground moving objects are extracted using background subtraction. The tracker is initialised by detecting smoothly moving foreground objects. After initialisation, a colour histogram of the pixels inside player bounding box is constructed. The histogram then serves as a template. Bhattacharyya distance between the template and each particle is used to weight the particle. The histogram template is updated online, to handle appearance shift.
Tracking of tennis players
Feature point tracking
Another different component of visual tracking is the feature point tracker. It is used for real-time self localisation, augmented reality, mosaicing, and supports the object tracker that has been already described above. The feature point tracker is based on the Shi-Tomasi-Kanade tracker. In order to meet the requirements with respect to computational efficiency and robustness, this approach was enhanced in several key areas.
The application of a linear illumination model reduced the sensitivity of the feature point tracker to illumination changes. This is a very critical point, as a change of the angle between the object's surface and the light source can lead to large changes of the intensities. Also, the auto exposure of the camera leads to fluctuations of the brightness.
For reducing of the computation time, several improvements were developed. One of them is an efficient hierarchical search for new features during run-time. Also, for ego-motion estimation, the traditional gradient descent algorithm for translation estimation was replaced with a block matching algorithm, which requires much lower overhead per frame and therefore allow much higher frame rates. We reach a frame rate of about 160 frames per seconds for 30 features on a personal computer with a 2.4 GHz Intel P4 CPU.
Gräßl, C.; Zinßer, T. & Niemann, H. A Probabilistic Model-Based Template Matching Approach for Robust Object Tracking in Real-Time
Girod, B.; Magnor, M. & Seidel, H. (ed.) Vision, Modeling, and Visualization, Aka / IOS Press, Berlin, Amsterdam , 81-88, 2004.
Zinßer, T.; Gräßl, C. & Niemann, H. Efficient Feature Tracking for Long Video Sequences
Rasmussen, C.E.; Bülthoff, H.H.; Giese, M.A. & Schölkopf, B. (ed.) Pattern Recognition, 26th DAGM Symposium, Springer-Verlag, Berlin, Heidelberg, New York , 326-333, 2004.