VAMPIRE___Visual Active Memory Processes and Interactive REtrieval
VAMPIRE Events Publications Consortium Media archive
Intro Research Activities Scenario 'Mobile augmented reality' Scenario 'video annotations' Slideshow
object recognition and learningvisual trackingaction recognitionAR gearself localisationscene analysis
contextual analysisinteraction and augmented realitysystem integration

Object Recognition and Learning

In the scenarios of VAMPIRE object recognition plays a crucial role for nearly all further processing. Objects are manipulated, they define contexts, and their positions are memorised. Thus, the robust recognition and classification of objects is a prerequisite for our cognitive vision system. Although object recognition is needed in all VAMPIRE scenarios, the challenges differ. In the Scenario "Mobile Augmented Reality" several different objects can occur in the scene and there is no or only very small prior knowledge about these objects. Thus, the needed object recognition system has to be fast (re-)trainable online and - since the system has to be reactive - also close to real-time performance in recognition. In the Scenario "Video Annotation" object recognition can follow more model based approaches but results have to be of very high accuracy for reliable video annotation.

Online Object Recognition and Learning

The object recognition subsystem, which is part of the VAMPIRE augmented reality system is based on a 2-step procedure - pre-segmentation and classification, that will be described in the following:

An Attentional Subsystem for Pre-Segmentation

Pre-segmentation is based on a data-driven calculation of different saliency measures such as local entropy, symmetry and Harris' edge-corner-detection. Each method produces a saliency map as output. These maps are integrated into a single Attention Map by a weighted summation over the input maps. The Attention Map is then used to determine candidate image regions, that might contain objects. The following picture shows an example of an Attention Map together with extracted regions and their centers of mass, called Focus Points (FPs).

The extracted Focus Points are input to the neural classification subsystem, which crops windows from the input image around the locations of the FPs and applies a neural classifier called VPL on them to derive object class labels. This will be described next.


An example of an Attention Map used to derive candidate object locations

Neural Classification

The applied neural classifier is called "VPL" and consists of a multi-stage architecture, which is based on Vector Quantisation (VQ) to partition the input space, Local Principal Component Analysis (LPCA) for feature extraction and Local Linear Maps (LLM) for classification. The first two stages, VQ and LPCA are trained unsupervised, whereas the final classification stage, LLM, is trained supervised using class label information. The following image shows a sketch of the architecture of the VPL.

Architecture of the neural classifier (VPL)

Online Learning

The neural classifier can be trained online using example views of objects which are collected by interaction strategies. The user selects a small number of appropriate views of the object to train. Additionally this view database is expanded by translation and scaling of the image patches to extend the training set for better classification performance. This database grows during system operation. As soon as a re-adaption of the object recognition subsystem is initiated, the classifier is retained online to recognise new examples and object classes that have been added to the view database. The three-stage architecture of the classifier allows for two different training modes:

  • Fast Learning: In this mode, only the final classification layer (LLM) is retrained, whereas the first two stages remain unchanged. This process is very fast and a new classifier can be loaded to the system almost immediately after the initiation of the retraining procedure. However, the classification accuracy of the classifier obtained by the "Fast Learning" is not optimal.
  • Full Training: At the same time, in a background process, a new - fully trained - classifier is computed, that also adapts the first two layers of the classifier. This process is more time consuming, but the classification accuracy is much better compared to the "Fast Learning" mode.
The following image shows a visualisation of the internal representation of the VQ and LPCA stages, that have been trained to recognise 5 objects, where 10 views of each object were supplied in the view-database.

Visualisation of the internal representation of the VPL classifier with 5 prototype vectors used in the VQ step and 5 Local PCs. The top row depicts the 5 prototypes of the first stage (Vector quantisation) and each column below shows the corresponding local principal components.

Object location in cluttered scenes

Location of objects in cluttered scene can be viewed as a task of matching a model of object to a scene. This known to be very complicated. The difficulties arise due to different appearance of an object from different distance or angle of view. The conventional way to tackle changes in appearance is to split one object onto several small parts. The appearance of these parts must change slightly or in predictable manner with change of distance or angle of view allowing search for matching pairs of part on model and scene. But these small parts along are not as unique as the whole object, which causes many of confusing pairs appearing as a result of matching. Using spatial location and connectivity information of parts allowing to remove false pairs from final result.

In our approach we split an object onto part while preserving spatial location and connectivity of parts by creating Attributed Relational Graph (ARG) whose nodes represent parts of an object and edges represent connectivity. Same algorithm is used for creating ARG of scene to be analysed. Each node and edge of ARG have associated feature vectors allowing checking of how well model node fits scene node and model edge fits scene edge.

Usual object represented by graph with 10 to 100 nodes and graph of not very cluttered scene has more than 1000 nodes. Therefore matching algorithm can not be build as simple exhaustive search due to huge number of variants. We used Relaxation Labelling approach for establishing the matching (ref to PRL paper)


Example of object localisation

Tennis court recognition

The first step in attempting to detect and interpret the activity in a given scene is to establish the existence of the scene and allow for registration of all subsequently tracked events into a model scene. Thus, in the case of tennis, we need to be able to identify the tennis court in the scene and project everything that happens in it on a model court. The only features that can be found in a tennis court are its lines and the presence of the net in the middle; those are the features used to identify the court in mosaic images.

Example of court recognition and
registration to model court

Foreground blob recognition in tennis ball tracking

In tennis ball tracking, foreground blobs are obtained by differencing temporally close frames. A foreground blob may be the true tennis ball, it may also be part of the player, the racket, or even part of the advertising boards, due to various inaccuracies. A method for foreground blob recognition is required. It has been suggested in the literature that the tennis ball has a standardised yellow colour, which can be used for the recognition. However, in some off-air material, since the colour bandwidth is low, and the tennis ball has a very small size, its colour can be strongly affected by the colour of the background it is travelling on. Moreover, tennis video archives are usually subject to artifacts introduced by analogue encoding, for instance, the PAL cross-colour effect, which appears as reddish noise colour "floating" in the image. This also suggests that colour information by itself is not sufficient for the recognition.

We suggest a multi-cue foreground blob recognition method which is robust against chroma noise. Each foreground blob is enlarged and interpolated. An ellipse is fit to the edge pixels. Points are then sampled along the ellipse. For each point, the surface normal direction is found, and the Sobel gradient is calculated. Under the assumption that a target-originated blob is a local maximum in intensity image and is approximately elliptical, alpha, the mean absolute angle difference of the normal directions and the gradient directions at all sample points, can be used as an important cue for blob recognition.

An 8-dimensional feature vector is constructed, with alpha as one dimension of it. Other dimensions include the coordinates of the blob centre (row and column), the parameters of the fitted ellipse (major axis, minor axis), and the mean of the pixels inside the blob in HSV channels. An SVM classifier is then trained to classify foreground blobs into tennis balls and non-balls.


Examples of blob recognition. Left: object originated blob. Right: clutter originated blob

Player pose

In most cases, the player pose can be a very important indication of activity in the context of tennis matches; therefore, being able to accurately determine the player pose will be of great help in discovering and exploiting context in such sequences. A specific case where a more elaborate analysis has been made is that of detecting whether a player is serving (see Actionrecognition:ServeDetection), where one can witness how much important contextual information can be revealed by studying the players' poses.


Example of player contour when serve detection suggests that a serve has occured

Selected Publications