VAMPIRE___Visual Active Memory Processes and Interactive REtrieval
 
VAMPIRE Events Publications Consortium Media archive
 
Journals Conference Papers
 
 

Bajramovic, F. and Graeßl, Ch. and Denzler, J.
Efficient Combination of Histograms for Real-Time Tracking Using Mean-Shift and Trust-Region Optimization
Proc. Pattern Recognition Symposium (DAGM), Springer , 2005.
Abstract: Histogram based real-time object tracking methods, like the Mean- Shift tracker of Comaniciu/Meer or the Trust-Region tracker of Liu/Chen, have been presented recently. The main advantage is that a suited histogram allows for very fast and accurate tracking of a moving object even in the case of partial occlusions and for a moving camera. The problem is which histogram shall be used in which situation. In this paper we extend the framework of histogram based tracking. As a consequence we are able to formulate a tracker that uses a weighted combination of histograms of different features. We compare our approach with two already proposed histogram based trackers for different histograms on large test sequences available to the public. The algorithms run in real-time on standard PC hardware.

Bax, I.; Heidemann, G. & Ritter, H.
A Hierarchical Feed-forward Network for Object Detection Tasks
Proc. SPIE Conf. on Independent Component Analyses, Wavelets, Unsupervised Smart Sensors, Neural Networks 5818:144-152, 2005.
Abstract: Recent research on Neocognitron-like neural feed-forward architectures, which have formerly been successfully applied to recognition of artificial stimuli like paperclip objects, is promising application to more natural stimuli. Several authors have shown high recognition performance of such networks with respect to translation, rotation, scaling and cluttered surroundings. In this contribution, we introduce a variation of existing hierarchical models, that is trained using a non-negative matrix factorization algorithm. In contrast to previous work, our approach can not only classify objects but is also capable of rapid object detection in natural scenes. Thus, the time-consuming and conceptually unsatisfying split-up into a localization stage (e.g. using segmentation) and a subsequent classification can be avoided. Though in principle an exhaustive search by classification of every sub-window of an image is performed, the process is nevertheless highly efficient. The network consists of alternating layers of simple and complex cell planes and incorporates nonlinear processing schemes that have been proposed in recent literature. Learning of receptive field profiles for the lower layers of the network takes place by unsupervised learning whereas a final classification layer is trained supervised. Detection is achieved by attaching an additional network layer, whose simple cell profiles are learned from the final classification units that were acquired during the training phase. We test the classification performance of the network on images of natural objects which are systematically distorted. To test the ability to detect objects, cluttered natural background is used.

Bax, I.; Heidemann, G. & Ritter, H.
Using Non-negative Sparse Profiles in a Hierarchical Feature Extraction Network
Proc. Machine Vision Applications , 2005.
Abstract: In this contribution we utilize recent advances in feature coding strategies for a hierarchical Neocognitron-like neural architecture, which can be used for invariant recognition of natural visual stimuli like objects or faces. Several researchers have identified that sparseness is an important coding principle for learning receptive field profiles that resemble response properties of simple cells in visual cortex. However, an ongoing discussion is concerned with the question whether sparseness should be imposed on the latent variables -- as implicitly done in ICA or Sparse Coding -- or if it should rather be imposed directly on the feature matrix. Since answers to this question have so far not been unique and were rather qualitative in nature, this paper investigates the two possibilities by applying a recently introduced algorithm for Non-negative Matrix Factorization with Sparseness Constraints (NMFSC) to feature learning in a hierarchical recognition network. For this network, we compare recognition performance on several difficult image datasets under varying sparseness settings.

Bax, I.; Heidemann, G. & Ritter, H.
Face Detection and Identification Using a Hierarchical Feed-forward Recognition Architecture
Proc. Intl Joint Conf. on Neural Networks , 2005.
Abstract: We apply a hierarchical feed-forward neural architecture to the problem of face recognition. The network is similar to the Neocognitron-approach and a two-layer variation of this architecture, which has previously been successfully applied to patch classification tasks. We extend this architecture to a three-layer one, which allows not only identification of image patches, but also detection in larger images. In the research area of face recognition a lot of expertise has been developed for the problem of either identification or detection, but approaches which deal with both problems simultaneously are rarely to be found. In this work, we apply the hierarchical approach to this problem and evaluate the performance on artificial datasets.

Bekel, H.; Heidemann, G. & Ritter, H.
SOM Based Image Data Structuring in an Augmented Reality Scenario
Proc. Intl Joint Conf. on Neural Networks , 2005.
Abstract: Our research focuses on the development of a mobile Augmented Reality System which is capable of acquiring image data in an unrestricted environment and which provides a comfortable facility to label this data. To structure the image data modified MPEG-7 features are computed and by means of Self organizing maps (SOM) the imagery can be labeled stepwise. First the complete data set is projected onto the SOM using a combination of color and edge features. In a second step selected parts of the imagery are retrained weighting the feature blocks depending on characteristics of the acquired image data. Within few steps the partitioning leads to SOM nodes on which the projected imagery can be labeled as objects or rejected.

Christmas, W.; Kostin, A.; Yan, F.; Kolonias, I. & Kittler, J.
A system for the automatic annotation of tennis matches
Fourth International Workshop on Content-Based Multimedia Indexing , 2005.
Abstract: In this paper we describe a system for the automatic annotation of tennis matches. The goal is to provide annotation at all levels, from shot detection to a complete breakdown of the scoring within the match. At present the system will automatically analyse a tennis video to the extent that it can identify the outcome of individual video shots, with reasonable accuracy. We briefly describe the overall system architecture, and describe in more detail the key components: the ball tracking and the high-level reasoning.

Deutsch, B. and Graessl, Ch. and Bajramovic, F. and Denzler, J.
A Comparative Evaluation of Template and Histogram Based 2-d Tracking Algorithms
Proc. Pattern Recognition Symposium (DAGM), Springer , 2005.
Abstract: In this paper, we compare and evaluate five contemporary, data-driven, real-time 2D object tracking methods: the region tracker by Hager et al., the Hyperplane tracker, the CONDENSATION tracker, and the Mean Shift and Trust Region trackers. The first two are classical template based methods, while the latter three are from the more recently proposed class of histogram based trackers. All trackers are evaluated for the task of pure translation tracking, as well as tracking translation plus scaling. For the evaluation, we use a publically available, labeled data set consisting of surveillance videos of humans in public spaces. This data set demonstrates occlusions, changes in object appearance, and scaling.

Fritsch, J.; Kleinehagenbrock, M.; Haasch, A.; Wrede, S. & Sagerer, G.
A Flexible Infrastructure for the Development of a Robot Companion with Extensible HRI-Capabilities
Proc. IEEE Int. Conf. on Robotics and Automation 3419-3425, 2005.
Abstract: The development of robot companions with natural human-robot interaction (HRI) capabilities is a challenging task as it requires incorporating various functionalities. Consequently, a flexible infrastructure for controlling module operation and data exchange between modules is proposed, taking into account insights from software system integration. This is achieved by combining a three-layer control architecture containing a flexible control component with a powerful communication framework. The use of XML throughout the whole infrastructure facilitates ongoing evolutionary development of the robot companion's capabilities.

Gräßl, C.; Zinßer, T. & Niemann, H.
3-D Object Tracking with the Adaptive Hyperplane Approach Using SIFT Models for Initialization
Conference on Machine Vision Applications 5-8, 2005.
Abstract: Object tracking is still a challenging task, especially if it is done in a realistic environment. The ongoing increase of computational power and the efficiency of the algorithms allow real-time estimation of the object's pose in six degrees of freedom. One of these algorithms is the 3-D hyperplane approach, which is used throughout this paper, as it has been proven to be fast and accurate. We show how to enhance its robustness by using a linear illumination model to gain more insensitivity to variations of the illumination conditions. We also present an adaption to compensate appearence changes in case of external rotations. Although some six degrees of freedom trackers have been established, the necessary initialization is often ignored or is only solved rudimentarily. In contrast to this, we show how to use a 3-D SIFT object model for initialization of the whole tracking system and prove its efficiency by experimental results using real image sequences.

Hanheide, M. and Bauckhage, C. and Sagerer, G.
Combining Environmental Cues & Head Gestures to Interact with Wearable Devices
International Conference on Multimodal Interfaces, 2005
Abstract: As wearable sensors and computing hardware are becoming a reality, new and unorthodox approaches to seamless human-computer interaction can be explored. This paper presents the prototype of a wearable, head-mounted device for advanced human-machine interaction that integrates speech recognition and computer vision with head gesture analysis based on inertial sensor data. We will focus on the innovative idea of integrating visual and inertial data processing for interaction. Fusing head gestures with results from visual analysis of the environment provides rich vocabularies for human-machine communication because it renders the environment into an interface: if objects or items in the surroundings are being associated with system activities, head gestures can trigger commands if the corresponding object is being looked at. We will explain the algorithmic approaches applied in our prototype and present experiments that highlight its potential for assistive technology. Apart from pointing out a new direction for seamless interaction in general, our approach provides a new and easy to use interface for disabled and paralyzed users in particular.

Kittler, J.; Christmas, W.; Kostin, A.; Yan, F.; Kolonias, I. & Windridge, D.
A memory architecture and contextual reasoning framework for cognitive vision
14th Scandinavian Conference on Image Analysis , 2005.
Abstract: One of the key requirements for a cognitive vision system to support reasoning is the possession of an effective mechanism to exploit context both for scene interpretation and for action planning. Context can be used effectively provided the system is endowed with a conducive memory architecture that supports contextual reasoning at all levels of processing, as well as a contextual reasoning framework. In this paper we describe a unified apparatus for reasoning using context, cast in a Bayesian reasoning framework. We also describe a modular memory architecture developed as part of the VAMPIRE vision system which allows the system to store raw video data at the lowest level and its semantic annotation of monotonically increasing abstraction at the higher levels. By way of illustration, we use as an application for the memory system the automatic annotation of a tennis match.

Lang, P. & Pinz, A.
Calibration of Hybrid Vision / Inertial Tracking Systems
Integration of Vision and Inertial Sensors , 2005.
Abstract: Within a hybrid vision / inertial tracking system proper calibration of the sensors and their relative pose is essential. We present a new method for 3-axis inertial sensor calibration based on model fitting and a method to find the rotation between vision and inertial system based on rotation differences. We achieve a coordinate system rotation mismatch of < 1° with respect to mechanical setup and sensor performance.

Lang, P.; Stock, C. & Pinz, A.
Sensor Fusion for 6 Degrees of Freedom Subwindow Tracking
submitted to British Machine Vision Conference , 2005.
Abstract: This work describes a method of fusing different sensor information, to estimate the pose of a camera. Exploiting the benefits of to almost complementary sensor types (vision and inertial) a reliable estimation of all 6 degrees of freedom of a camera can be achieved. The resulting pose stream provides less jitter and allows faster movements of the camera (especially rotation) than the purely vision-based tracking system. After fully automatic initialization natural landmarks are used for pose estimation. Experimental results show that the proposed system claims the required properties.

Lütkebohle, I.; Wrede, S. & Wachsmuth, S.
Unsupervised Filtering of XML Streams for System Integration
International Workshop on Pattern Recognition in Information Systems , 2005.
Abstract: In the last years, computer vision research is more and more shifting from algorithmic solutions to the construction of active systems. One novel approach to system construction combines data- and event-driven architectures, concentrating on the flow of information between components. A challenge in data-driven architectures is to optimize communications behavior without changing component implementations. For example, in computer vision, a common problem is that low-level components produce many very similar results whereas on a higher level, only significant changes are of interest. This distinction can be defined as a pattern recognition task that analyzes the data flow in the system. In the following, we will first give a short introduction into the architecture, then describe a generic solution for data-flow reduction based on XML distance metrics. We present first results on the application of this component in an integration framework for a vision-based Human-Computer-Interface within an augmented reality scenario.

Siegl, H.; Hanheide, M.; Wrede, S. & Pinz, A.
AR Human Computer Interface for Object Localization in a Cognitive Vision System
submitted to Image and Vision Computing (Special Issue on HCI) , 2005.

Siegl, H.; Hanheide, M.; Wrede, S. & Pinz, A.
Augmented Reality as Human Computer Interface in a Cognitive Vision System
submitted to ISMAR , 2005.

Siegl, H. & Pinz, A.
A Stereoscopic AR-based Object Localization Tool for a Cognitive Vision System
Joint Hungarian-Austrian Conference on Image Processing and Pattern Recognition 11-16, 2005.

Stock, C. & Pinz, A.
Vision-based Tracking-framework using Natural Landmarks for Indoor Augmented Reality
submitted to ISMAR , 2005.
Abstract: This work presents a real-time tracking approach, which is able to use natural landmarks for camera pose estimation. The system works fully automatic, even initialization is done without any user interaction. Only one artificial landmark is used for automatic initialization. Natural landmarks are used by projection of their known position in 3D onto the image plane of the camera. These projected areas can be used for 2D feature tracking. The accuracy of the used 3D features must be better than 3.5 mm, otherwise the system fails. Experimental results in a real office environment show that the approach is valid and satisfies the required properties of augmented reality applications. The mean jitter of the system is smaller than 3 cm in position and 1° in orientation. Update-rates of more than 100 fps of the resulting pose-stream can be achieved.

Stock, C. & Pinz, A.
Tracking of Natural Landmarks for Augmented-Reality Purpose
Joint Hungarian-Austrian Conference on Image Processing and Pattern Recognition 343-349, 2005.

Yan, F.; Christmas, W.; Kittler, J.
A Tennis Ball Tracking Algorithm for Automatic Annotation of Tennis Match

The British Machine Vision Conference, to appear, 2005.
Abstract: Several tennis ball tracking algorithms have been reported in the literature. However, most of them use high quality video and
multiple cameras, and the emphasis has been on coordinating the cameras, or visualising the tracking results. In this paper, we propose a tennis
ball tracking algorithm for low quality off-air video recorded with a single camera. Multiple visual cues are exploited for tennis candidate
detection. A particle filter with improved sampling efficiency is used to track the tennis candidates. Experimental results show that our
algorithm is robust and has a tracking accuracy that is sufficiently high for automatic annotation of tennis matches.

Zinßer, T.; Schmidt, J. & Niemann, H.
Point Set Registration with Integrated Scale Estimation
Int. Conf. on Pattern Recognition and Information Processing , 2005.
Abstract: We present an iterative registration algorithm for aligning two differently scaled 3-D point sets. It extends the popular Iterative Closest Point (ICP) algorithm by estimating a scale factor between the two point sets in every iteration. The presented algorithm is especially useful for the registration of point sets generated by structure-from-motion algorithms, which only reconstruct the 3-D structure of a scene up to scale. Like the original ICP algorithm, the presented algorithm requires a rough pre-alignment of the point sets. In order to determine the necessary accuracy of the pre-alignment, we have experimentally evaluated the basin of convergence of the algorithm with respect to the initial rotation, translation, and scale factor between the two point sets.

Bauckhage, C.; Hanheide, M.; Wrede, S. & Sagerer, G.
A Cognitive Vision System for Action Recognition in Office Environments
Proc. of IEEE Conf. on Computer Vision and Pattern Recognition 2:827-833, 2004.
Abstract: The emerging cognitive vision paradigm is concerned with vision systems that evaluate, gather and integrate contextual knowledge for visual analysis. In reasoning about events and structures, cognitive vision systems should rely on multiple computations in order to perform robustly even in noisy domains. Action recognition in an unconstrained office environment thus provides an excellent testbed for research on cognitive computer vision. In this contribution, we present a system that consists of several computational modules for object and action recognition. It applies attention mechanisms, visual learning and contextual as well as probabilistic reasoning to fuse individual results and verify their consistency. Database technologies are used for information storage and an XML based communication framework integrates all modules into a consistent architecture.

Bekel, H.; Bax, I.; Heidemann, G. & Ritter, H.
Adaptive Computer Vision: Online Learning for Object Recognition
Rasmussen, C.E.; Bülthoff, H.H.; Schölkopf, B. & Giese, M.A. (ed.) Proc. Pattern Recognition Symposium (DAGM), Springer 3175:447-454, 2004.
Abstract: The "life" of most neural vision systems splits into a one-time training phase and an application phase during which knowledge is no longer acquired. This is both technically inflexible and cognitively unsatisfying. Here we propose an appearance based vision system for object recognition which can be adapted online, both to acquire visual knowledge about new objects and to correct erroneous classification. The system works in an office scenario, acquisition of object knowledge is triggered by hand gestures. The neural classifier offers two ways of training: Firstly, the new samples can be added immediately to the classifier to obtain a running system at once, though at the cost of reduced classification performance. Secondly, a parallel processing branch adapts the classification system thoroughly to the enlarged image domain and loads the new classifier to the running system when ready.

Chen, J. & Pinz., A.
Structure and motion by fusion of inertial and vision-based tracking
Burger, W. & J.Scharinger (ed.) Proc. of the 28th. ÖAGM/AAPR Conference on Digital Imaging in Media and Education, OCG 179:55-62, 2004.
Abstract: We present a new structure and motion framework for real-time tracking applications combining inertial sensors with a camera. Our method starts from initial estimation of the state vector which is then used for the structure from motion algorithm. The algorithm can simultaneously determine the position of the sensors, as well as estimate the structure of the scene. An extended Kalman filter is used to estimate motion by fusion of inertial and vision data. It includes two independent measurement channels for the low frequency vision-based measurements and for the high frequency inertial measurements, respectively. A bank of Kalman filters are designed to estimate the 3D structure of the real scene by using the result of motion estimation. These two parts work alternately. Our experimental results show good convergence of estimated scene structure and ground truth. Potential applications are in mobile augmented reality and in mobile robotics.

Chen, J. & Pinz., A.
Simultaneous tracking and modeling by fusion of inertial and vision sensors
Proc. of CAPTECH 2004 26-31, 2004.
Abstract: Visual tracking applications are often facing situations with partially unknown or temporally changing environment. Existing vision based structure and motion algorithms are too fragile and tend to drift. This paper investigates how the fusion of inertial and vision data can be used to gain robustness. Fusion is based on Kalman filtering, using an Extended Kalman filter to fuse inertial and vision data, and a bank of Kalman filters to estimate the sparse 3D structure of the real scene. A simple, known target is used for initial pose estimation. Subsequently, motion and structure estimation filters work alternately to recover the sensor motion, scene structure and other parameters. We analyze the uncertainty distribution of reconstructed feature points, and show that inertial data can be used to improve position accuracy of reconstructed features and motion estimation. The performance of this algorithm has been tested on synthetic data and on real image sequences. Experimental results show the efficiency of additional inertial information for improved accuracy of pose estimation and the reduction of drift.

Cheng, F.; Christmas, W. & Kittler, J.
Periodic Human Motion Description for Sports Video Databases
Int. Conf. on Pattern Recognition 3:870- 873, 2004.
Abstract: Many different visual features can be used for analysis and annotation of sports video material. Here we present a periodic motion feature descriptor that can discriminate between different sports types that contain periodic motion. The experimental results, using video material from the 1992 Barcelona Olympic Games, show that the proposed periodic motion descriptor can successfully classify four sports types: sprint, long-distance running, hurdling and canoeing.

Deutsch, B.; Scholz, I.; Graeßl, C. & Niemann:, H.
Extending Light Fields using Object Tracking Techniques
Vision, Modeling, and Visualization 109-116, 2004.
Abstract: We present two new approaches to extending existing light fields with additional image data. In this case a light field is initially constructed from an image sequence taken by a hand-held camera, and pose parameters of this camera obtained through structure-from-motion approaches. To extend such a light field, point correspondences are necessary from one image in the original sequence to the new images to estimate their relative poses. The two introduced approaches assist in finding the original image closest to the new image, and provide initial motion estimates. A SIFT feature based method is used to determine the closest image and an imagespace motion homography. The second approach uses images rendered from the light field to estimate the camera pose of the image to be added using adaptive random search or a particle filter.

Gorges, N.; Hanheide, M.; Christmas, W.; Bauckhage, C.; Sagerer, G. & Kittler, J.
Mosaics from Arbitrary Stereo Video Sequences
Rasmussen, C.E.; Bülthoff, H.H.; Giese, M.A. & Schölkopf, B. (ed.) Proc. Pattern Recognition Symposium (DAGM), Springer-Verlag 3175:342-349, 2004.
Abstract: Although mosaics are well established as a compact and non-redundant representation of image sequences, their application still suffers from restrictions of the camera motion or has to deal with parallax errors. We present an approach that allows construction of mosaics from arbitrary motion of a head-mounted camera pair. As there are no parallax errors when creating mosaics from planar objects, our approach first decomposes the scene into planar sub-scenes from stereo vision and creates a mosaic for each plane individually. The power of the presented mosaicing technique is evaluated in an office scenario, including the analysis of the parallax error.

Gräßl, C.; Zinßer, T. & Niemann, H.
A Probabilistic Model-Based Template Matching Approach for Robust Object Tracking in Real-Time
Girod, B.; Magnor, M. & Seidel, H. (ed.) Vision, Modeling, and Visualization, Aka / IOS Press, Berlin, Amsterdam 81-88, 2004.
Abstract: In recent years, template matching approaches for object tracking in real-time have become more and more popular, mainly due to the increase in available computational power and the advent of very efficient algorithms. Particularly, data-driven methods based on first order approximations have shown very promising results. If the object to be tracked is known, a model-based tracking algorithm is preferable, because available knowledge of the appearence of the object from different views can be used to improve the robustness of the tracking. In this paper, we enhance the well-known hyperplane tracker with a probabilistic tracking framework using the CONDENSATION algorithm, which is noted for its robustness and efficiency. Furthermore, we put forward a subspace method for improving the tracker's robustness against illumination variations. We prove the efficiency of our proposed methods with experiments on video sequences of real scenes

Gräßl, C.; Zinßer, T. & Niemann, H.
Efficient Hyperplane Tracking by Intelligent Region Selection
Proc. IEEE Southwest Symposium on Image Analysis and Interpretation 51-55, 2004.
Abstract: The main aim of this work is to improve the accuracy of Jurie's hyperplane tracker for real-time template matching. As the computation time of the initialization of the algorithm depends on the number of points used for estimating the motion of the template, only a subset of points in the tracked template is considered. Traditionally, this subset is determined by random. We present three different methods for selecting points better suited for the hyperplane tracker. We also propose to incorporate color information by working with eigenintensities instead of gray-level intensities, which can greatly improve the estimation accuracy, but only entails a slight increase in computation time. We have carefully evaluated the performance of the proposed methods in experiments with real image sequences.

M. Hanheide and C. Bauckhage and and G. Sagerer
Memory Consistency Validation in a Cognitive Vision System
Int. Conf. on Pattern Recognition, IEEE 459-462, 2004.
Abstract: Ensuring the consistency of memory content is a key feature of cognitive vision systems. This paper presents an approach to deal with functional dependencies of hypotheses stored in a visual active memory. By means of Bayesian networks a probabilistic approach is used to incorporate uncertainty of observations. Furthermore, a measurement to detect inconsistencies in the memory is introduced. The benefit of this validation module as part of an integrated system is shown for the task of visual surveillance in an office scenario.

Heidemann, G.; Bax, I.; Bekel, H.; Bauckhage, C.; Wachsmuth, S.; Fink, G.; Pinz, A.; Ritter, H. & Sagerer, G.
Multimodal interaction in an augmented reality scenario
Int. Conf. on Multimodal Interfaces 53-60, 2004.
Abstract: We describe an augmented reality system designed for online acquisition of visual knowledge and retrieval of memorized objects. The system relies on a head mounted camera and display, which allow the user to view the environment together with overlaid augmentations by the system. In this setup, communication by hand gestures and speech is mandatory as common input devices like mouse and keyboard are not available. Using gesture and speech, basically three types of tasks must be handled: (i) Communication with the system about the environment, in particular, directing attention towards objects and commanding the memorization of sample views; (ii) control of system operation, e.g. switching between display modes; and (iii) re-adaptation of the interface itself in case communication becomes unreliable due to changes in external factors, such as illumination conditions. We present an architecture to manage these tasks and describe and evaluate several of its key elements, including modules for pointing gesture recognition, menu control based on gesture and speech, and control strategies to cope with situations when vision becomes unreliable and has to be re-adapted by speech.

Heidemann, G.; Bekel, H.; Bax, I. & Ritter, H.
Interactive Online Learning
Proc. int. Conf on Pattern Recognition and Image Analysis, St. Petersburg Electrotechnical University 1:44-48, 2004.
Abstract: We present a computer vision system for object recognition which is integrated in an augmented reality setup. The system can be trained online to the recognition of objects in an intuitive way. The augmented reality gear allows interaction using hand gestures for the control of displayed "virtual menus". The underlying neural recognition system combines feature extraction and classification. Its three-stage architecture facilitates fast adaptation: In a fast training mode (FT), only the last stage is adapted, whereas complete training (CT) re-builds the system from scratch. Using FT, online acquired views can be added at once to the classifier, the system being operational after a delay of less than a second, though still with reduced classification performance. In parallel, a new classifier is trained (CT) and loaded to the system when ready.

Heidemann, G.; Bekel, H.; Bax, I. & Saalbach, A.
Hand Gesture Recognition: Self-Organising Maps as a Graphical User Interface for the Partitioning of Large Training Data Sets
Kittler, J.; Petrou, M. & Nixon, M. (ed.) Int. Conf. on Pattern Recognition, IEEE CS-Press 4:487-490, 2004.
Abstract: Gesture recognition is a difficult task in computer vision due to the numerous degrees of freedom of a human hand. Fortunately, human gesture covers only a small part of the theoretical configuration space of a hand, so an appearance based representation of human gesture becomes tractable. A major problem, however, is the acquisition of appropriate labelled image data from which an appearance based representation can be built. In this paper we apply self-organising maps for a visualisation of large amounts of segmented hands performing pointing gestures. Using a graphical interface, an easy labelling of the data set is facilitated. The labelled set is used to train a neural classification system, which is itself embedded in a larger architecture for the recognition of gestural reference to objects

Jaser, E.; Christmas, W. & Kittler, J.
Hierarchical Decision Making Scheme for Sports Video Categorisation with Temporal Post-Processing
Proc. of IEEE Conf. on Computer Vision and Pattern Recognition 2:908-913, 2004.
Abstract: The problem of automatic sports video classification is considered. We develop a multistage decision making system that is founded on the concept of cues, i.e. pieces of visual evidence, characteristic of certain categories of sports that are extracted from key frames. The main decision making mechanism is a decision tree which generate hypotheses concerning the semantics of the sports video content. The final stage of the decision making process is a Hidden Markov Model system which bridges the gap between the semantic content categorisation defined by the user and the actual visual content categories. The latter is often ambiguous, as the same visual content may be attributed to different sport categories, depending on the context. We demonstrate experimentally that the contextual post-processing of the decision tree outputs by HMMs significantly improves the performance of the sports video classification system.

Jaser, E.; Christmas, W. & Kittler, J.
Temporal Post-Processing of Decision Tree Outputs for Sports Video Categorisation
Ana Fred, T.C. (ed.) Joint IAPR Int. Workshop on Syntactical and Structural Pattern Recognition, Springer-Verlag GmbH 3138:495, 2004.
Abstract: In this paper, we describe a multistage decision making system to deal with the problem of automatic sports video classification. The system is founded on the concept of cues, i.e. pieces of visual evidence, characteristic of certain categories of sports that are extracted from key frames. The main decision making mechanism is a decision tree which generates hypotheses concerning the semantics of the sports video content. The final stage of the decision making process is a Hidden Markov Model system which bridges the gap between the semantic content categorisation defined by the user and the actual visual content categories. The latter is often ambiguous, as the same visual content may be attributed to different sport categories, depending on the context. We tested the system using two setups of HMMs. In the first, we construct and train an HMM model for each sport. A post-processing step is needed in this setup to combine the outcomes of the individual HMMs. In the second setup, we eliminate the need for post-processing by constructing a single HMM with each node representing one of the sports we want to detect. Comparing the results obtained from both setups showed that a single HMM delivered the better performance.

Kittler, J. & Sadeghi, M.
Physics-based decorrelation of image data for decision level fusion in face verification
Roli, F.; Kittler, J. & Windeatt, T. (ed.) Proc. Int. Workshop Multiple Classifier Systems, Springer 354-363 , 2004.
Abstract: We consider the problem of face verification using multichannel image data where each channel serves as the input to a separate face verification expert. By decorrelating the information content of the respective data channels, we enhance the diversity of the resulting face verification experts as well as the performance of the multiple classifier system.

Kittler, J. & Sadeghi, M.
Approximate Gradient Direction Metric for Face Authentication
Ana Fred, T.C. (ed.) Joint IAPR Int. Workshop on Syntactical and Structural Pattern Recognition, Springer-Verlag GmbH 3138:797, 2004.
Abstract: In pattern recognition problems where the decision making is based on a measure of similarity, the choice of an appropriate distance metric significantly influences the performance and speed of the decision making process. We develop a novel metric which is an approximation of the successful Gradient Direction (GD) metric. The proposed metric is evaluated on a face authentication problem using the Banca database. It outperforms the standard benchmark, the normalised correlation. Although it is not as powerful as GD metric, it is ten times faster.

Kittler, J. & Sadeghi, M.
Physics-based decorrelation of image data for decision level fusion in face verification
Fabio Roli, J.K. (ed.) Proc. Int. Workshop Multiple Classifier Systems, Springer-Verlag GmbH 3077:354 - 363, 2004.
Abstract: We consider the problem of face verification using multichannel image data where each channel serves as the input to a separate face verification expert. By decorrelating the information content of the respective data channels, we enhance the diversity of the resulting face verification experts as well as the performance of the multiple classifier system.

Kolonias, I.; Christmas, W. & Kittler, J.
Automatic Evolution Tracking for Tennis Matches Using an HMM-Based Architecture
Proceedings of the IEEE Machine Learning for Signal Processing Workshop 2004 615-624, 2004.
Abstract: Creating a cognitive vision system which will infer high-level semantic information from low-level feature and event information for a given type of multimedia content is a problem attracting many researchers attention in recent years. In this work, we address the problem of automatic interpretation and evolution tracking of a tennis match using standard broadcast video sequences as input data. The use of a hierarchical structure consisting of Hidden Markov Models is proposed. This will take low-level events as its input and produce an output where the final state will indicate if the point is to be awarded to one player or another. Using ground-truth data as input for the classifier described, the points are always correctly awarded to the players. Even when modifying the ground-truth data with errors randomly inserted in it and use it as input for the proposed system, the system performance degraded gracefully.

Kolonias, I.; Christmas, W. & Kittler, J.
Tracking the Evolution of a Tennis Match using Hidden Markov Models
Fred, I.A.; Caelli, T.; Duin, R.P.; Campilho, A. & Ridder, D. (ed.) Joint IAPR Int. Workshop on Syntactical and Structural Pattern Recognition, Springer-Verlag 3138:1078?1086, 2004.
Abstract: The creation of a cognitive perception systems capable of inferring higher-level semantic information from low-level feature and event information for a given type of multimedia content is a problem that has attracted many researchers attention in recent years. In this work, we address the problem of automatic interpretation and evolution tracking of a tennis match using standard broadcast video sequences as input data. The use of a hierarchical structure consisting of Hidden Markov Models is proposed. This will take low-level events as its input, and will produce an output where the final state will indicate if the point is to be awarded to one player or another. Using hand-annotated data as input for the classifier described, we have witnessed 100% of the points correctly awarded to the players.

Kolonias, I.; Christmas, W. & Kittler, J.
Use of context in automatic annotation of sports videos
Progress in Pattern Recognition, Image Analysis and Applications: Iberoamerican Congress on Pattern Recognition 3287:1-12, 2004.
Abstract: Creating a cognitive vision system which will infer high-level semantic information from low-level feature and event information for a given type of multimedia content is a problem attracting many researchers attention in recent years. In this work, we address the problem of automatic interpretation and evolution tracking of a tennis match using standard broadcast video sequences as input data. The use of a hierarchical structure consisting of Hidden Markov Models is proposed. This will take low-level events as its input and produce an output where the final state will indicate if the point is to be awarded to one player or another. Using ground-truth data as input for the classifier described, the points are always correctly awarded to the players. Even when modifying the ground-truth data with errors randomly inserted in it and use it as input for the proposed system, the system performance degraded gracefully.

Rohlfing, T.; Denzler, J.; Russakoff, D.; Gräßl, C. & Maurer, C.
Markerless Real-Time Target Region Tracking: Application to Frameless Sterotactic Radiosurgery
Girod, B.;.M. (ed.) Int. Workshop on Vision, Modeling, and Visualization, Aka / IOS Press 5-12, 2004.
Abstract: Accurate and fast registration of intra-operative 2D projection images to 3D pre-operative images is an important component of many image-guided surgical procedures. If the 2D image acquisition is repeated several times during the procedure, the registration problem can be cast instead as a 3D tracking problem. To solve the 3D problem, we propose in this paper to apply a real-time 2D region tracking algorithm to first recover the components of the transformation that are in-plane to the projections. From the 2D motion estimates of all projections, a consistent estimate of the 3D motion is derived. We compare this method to computation in 3D and a combination of both. Using clinical data with a gold-standard transformation, we show that a standard tracking algorithm is capable of accurately and robustly tracking regions in x-ray projection images, and that the use of 2D tracking greatly improves the accuracy and speed of 3D tracking.

Sadeghi, M. & Kittler, J.
A Comparative Study of Data Fusion Strategies in Face Verification
Proc. European Signal Processing Conference to appear, 2004.
Abstract: In this paper, the merits of fusing colour information in a face verification system is studied. Three different levels of fusion, namely, signal, feature and decision levels are considered. The study is performed on a fisherface-based (LDA) verification system considering the Gradient Direction metric as the scoring function. We show that almost all the fusion methods enhance the performance of the system. However, despite the common use of the fusion at the signal level realised by creating intensity images, the other fusion methods especially the decision level fusion using score averaging are more effective.

Sadeghi, M. & Kittler, J.
Data Fusion in Face Verification
Proc. Second COST 275 Workshop, Biometrics on the Internet: Fundamentals, Advances and Applications 63-68, 2004.
Abstract: In this paper, the merits of fusing colour information in a face verification system is studied. Three different levels of fusion, namely, signal, feature and decision levels are considered. The study is performed on a fisherface-based (LDA) verification system considering the Gradient Direction metric as the scoring function. We show that almost all the fusion methods enhance the performance of the system. However, despite the common use of the fusion at the signal level realised by creating intensity images, the other fusion methods specially the decision level fusion using score averaging are more effective.

Sadeghi, M. & Kittler, J.
Decision Making in the LDA Space: Generalised Gradient Direction Metric
Mohammad T. Sadeghi, & Kittler, J. (ed.) Proc. Int. Conf. on Automatic Face and Gesture Recognition 0:248-253, 2004.
Abstract: We consider the problem of face authentication in the Linear Discriminant Analysis (LDA) space and investigate the effect of different scoring functions on the performance of the authentication system. First the theory of optimal metric for measuring the similarity between a pair of face images presented in [On matching scores for lda-based face verification] is extended to cope with general class specific covariance structures. The resulting gradient metric is experimentally compared with the commonly used normalised correlation and the original gradient metric. The merit of global and client specific thresholding is also investigated. The study is performed on the BANCA database [The BANCA database and evaluation protocol] using internationally agreed experimental protocols. The results suggest that the novel metric is superior in scenarios where the quality of input face data is comparable to the quality of data used for determining the LDA space. In other cases, the weaker model deploying the isotropic covariance matrix in working out the gradient direction is preferable.

Siegl, H. & Pinz, A.
A Mobile AR kit as a Human Computer Interface for Cognitive Vision
Int. Workshop on Image Analysis for Multimedia Interactive Services , 2004.
Abstract: Existing Augmented Reality (AR) applications suffer from restricted mobility and insufficient tracking (head-pose calculation) capabilities to be used in fully mobile, potentially outdoor applications. We present a new AR-kit, which has been designed for modular and flexible use in mobile, stationary, in- and outdoor situations. The system is wearable and consists of two independent subsystems, one for video augmentation and 3D visualization, the other one for real-time tracking fusing vision-based and inertial tracking components. Several AR-kits can be operated simultaneously, communicating via wireless LAN, thus enabling in- and outdoor applications of mobile multiuser AR scenarios. In the European cognitive vision project VAMPIRE (IST-2001-34401), our AR-kits are used for interactive teaching of visual active memory. This is achieved via a new kind of 3D augmented pointing, which combines inside-out tracking and 3D stereo HCI, and delivers approximate scene coordinates and extent of real objects in the scene.

Siegl, H.; Schweighofer, G. & Pinz, A.
An AR Human Computer Interface for Object Localization in a Cognitive Vision Framework
Sebe, N.; Lew, M.S. & Huang, T.S. (ed.) Int. Workshop on Computer Vision in Human-Computer Interaction, ECCV 2004, Springer 3058:176-186, 2004.
Abstract: In the European cognitive vision project VAMPIRE (IST-2001-34401), mobile AR-kits are used for interactive teaching of a visual active memory. This is achieved by 3D augmented pointing, which combines inside-out tracking for head pose recovery and 3D stereo HCI in an office environment. An artificial landmark is used to establish a global coordinate system, and a sparse reconstruction of the office provides natural landmarks (corners). This paper describes the basic idea of the 3D cursor. In addition to the mobile system, at least one camera is used to obtain different views of an object which could be employed to improve e.g. view based object recognition. Accuracy of the 3D cursor for pointing in a scene coordinate system is evaluated experimentally.

Stock, C.
Real-time, purely vision-based tracking-framework using natural landmarks
TU Graz , 2004.

Stock, C.
Real-Time Tracking Avoids the Correspondence Problem in Spatio-Temporal Analysis
BMVA Symp. "Spatio Temporal Image Processing" , 2004.
Abstract: There are many well-known applications of spatio-temporal analysis, which require known landmarks (fiducials, control points, natural landmarks) in the scene, e.g., camera pose estimation [Lu], object tracking [Kato], and wide baseline stereo [Baumberg]. Typically we face a high initial complexity of the model-to-image or model-to-scene correspondence problem [Brandner]. Imagine for example the initialisation of a tracking process based on the matching of a simple scene model (e.g. 5 control points) and a feature set (e.g. several hundred points extracted from the first image of the sequence). This problem may be alleviated by extracting additional descriptive features (colour, cornerness, texture, etc., see [Stock]), but still persists. Recent progress in real-time tracking and online structure and motion analysis has led us to a new approach, which is based on frame-to-frame point correspondences and on emerging structure estimation, and completely avoids the hard matching problem during initialisation or re-initialisation. This opens up a new field of online spatio-temporal reasoning which may be successfully applied in several high-potential applications.

Stock, C.; Lambrecht, M.; Opelt, A. & Pinz., A.
Object centered feature selection for weakly-unsupervised object categorization
Workshop of the Austrian Association for Pattern Recognition (ÖAGM/AAPR), OCG 179:79-86, 2004.
Abstract: We describe a novel approach of spatio-temporal mapping of local image features, to reduce the number of input data for further object categorization. The main focus of our work is the selection of good features to learn, by achieving a precise mapping of image features either related to static objects or to background. This can be done by initial camera motion estimation, subsequent structure estimation and final clustering of the 3D points. Experimental results show that our method achieves a significant reduction of processed image features, which yields a better performance in subsequent learning modules.

Wrede, S.; Fritsch, J.; Bauckhage, C. & Sagerer, G.
An XML Based Framework for Cognitive Vision Architectures
Int. Conf. on Pattern Recognition 757-760, 2004.
Abstract: Distributed processing and memory structures are very important aspects of cognitive vision systems. Both issues not only require sophisticated conceptual designs but also pose problems of software and systems engineering. In this paper, we describe a general XML based solution to these problems. Practical experiences are reported to underline its suitability.

Wrede, S.; Hanheide, M.; Bauckhage, C. & Sagerer, G.
An Active Memory as a Model for Information Fusion
Int. Conf. on Information Fusion 198-205, 2004.
Abstract: Information fusion is a mandatory prerequisite for cognitive vision systems. These are vision systems that apply reasoning and learning on different levels of abstraction and correspondingly have to deal with hypotheses from different categorical domains. Following some principles of human cognition, we present an approach to information fusion that closely couples reasoning and representation. We will discuss how processes like probabilistic contextual reasoning as well as functional and non-functional requirements in storing data from different sources can be integrated by a unified XML based data representation. Due to the interaction between active processes and data storage, we call our approach an active memory. Performance results of an implemented system as well as an evaluation of data fusion from contextual inference will be presented

Wrede, S.; Ponweiser, W.; Bauckhage, C.; Sagerer, G. & Vincze, M.
Integration Frameworks for Large Scale Cognitive Vision Systems - An Evaluative Study
Int. Conf. on Pattern Recognition 761-764, 2004.
Abstract: Owing to the ever growing complexity of present day computer vision systems, system architecture has become an emerging topic in vision research. Systems that integrate numerous modules and algorithms of different I/O and time scale behavior require sound and reliable concepts for interprocess communication. Consequently, topics and methods known from software and systems engineering are becoming increasingly important. Especially framework technologies for system integration are required. This contribution results from a cooperation between two multinational projects on cognitive vision. It discusses functional and non-functional requirements in cognitive vision and compares and assesses existing solutions.

Zinßer, T.; Gräßl, C. & Niemann, H.
Efficient Feature Tracking for Long Video Sequences
Rasmussen, C.E.; Bülthoff, H.H.; Giese, M.A. & Schölkopf, B. (ed.) Pattern Recognition, 26th DAGM Symposium, Springer-Verlag, Berlin, Heidelberg, New York 326-333, 2004.
Abstract: This work is concerned with real-time feature tracking for long video sequences. In order to achieve efficient and robust tracking, we propose two interrelated enhancements to the well-known Shi-Tomasi-Kanade tracker. Our first contribution is the integration of a linear illumination compensation method into the inverse compositional approach for affine motion estimation. The resulting algorithm combines the strengths of both components and achieves strong robustness and high efficiency at the same time. Our second enhancement copes with the feature drift problem, which is of special concern in long video sequences. Refining the initial frame-to-frame estimate of the feature position, our approach relies on the ability to robustly estimate the affine motion of every feature in every frame in real-time. We demonstrate the performance of our enhancements with experiments on real video sequences.

Ahmadyfard, A. & Kittler, J.
A Multiple Classifier System Approach to Affine Invariant Object Recognition
Int. Conf. on Computer Vision Systems 2626:438-447, 2003.
Abstract: We propose an affine invariant object recognition system which is based on the principle of multiple classifier fusion. Accordingly, two recognition experts are developed and used in tandem. The first expert performs a course grouping of the object hypotheses based on an entropy criterion. This initial classification is performed using colour cues. The second expert establishes the object identity by considering only the subset of candidate models contained in the most probable coarse group. This expert takes into account geometric relations between object primitives and determines the winning hypothesis by means of relaxation labelling. We demonstrate the effectiveness of the proposed object recognition strategy on the Surrey Object Image Library database. The experimental results not only show improved recognition performance but also a computational speed up.

Bauckhage, C.; Käster, T.; Pfeiffer, M. & Sagerer, G.
Content-Based Image Retrieval by Multimodal Interaction
Annual Conference of the IEEE Industrial Electronics Society (IECON) 1865-1870, 2003.
Abstract: Due to the size of todays professional image databases, the standard approach to content-based image retrieval is to interactively navigate through the content. However, most people whose job necessitates working with such databases do not have a technical background. Commercial practice thus requires efficient retrieval techniques as well as navigation inter- faces that are intuitive to use and easy to learn. In this paper we introduce a system for interactive image retrieval that combines different approaches to feature based queries. Furthermore, it allows multimodal interaction because apart from conventional input devices like mouse and keyboard, it is possible to operate the system using a touch screen or even natural language. Besides technical details and results on retrieval accuracy, we will also present results of usability experiments which underline that users well appreciate multimodal interfaces for image retrieval.

Bax, I.; Bekel, H. & Heidemann, G.
Recognition of Gestural Object Reference with Auditory Feedback
Int. Conf. on Artificial Neural Networks 2714:425-432, 2003.
Abstract: We present a cognitively motivated vision architecture for the evaluation of pointing gestures. The system views a scene of several structured objects and a pointing human hand. A neural classifier gives an estimation of the pointing direction, then the object correspondence is established using a sub-symbolic representation of both the scene and the pointing direction. The system achieves high robustness because the result (the indicated location) does not primarily depend on the accuracy of the pointing direction classification. Instead, the scene is analysed for low level saliency features to restrict the set of all possible pointing locations to a subset of highly likely locations. This transformation of the "continuous" to a "discrete" pointing problem simultaneously facilitates an auditory feedback whenever the object reference changes, which leads to a significantly improved human-machine interaction.

Chandraker, M.; Stock, C. & Pinz, A.
Real Time Camera Pose in a Room
Int. Conf. on Computer Vision Systems 2626:98-110, 2003.
Abstract: Many applications of computer vision require camera pose in real-time. We present a new, fully mobile, purely vision-based tracking system that works indoors in a prepared room, using artificial landmarks. The main contributions of the paper are: improved pose accuracy by subpixel corner localization, high frame rates by CMOS image aquisition of small subwindows, and a novel sparse 3D model of the room for a spatial target representation and selection scheme which gains robustness.

Christmas, W.; Jaser, E.; Messer, K. & Kittler, J.
A Multimedia System Architecture for Automatic Annotation of Sports Videos
Int. Conf. on Computer Vision Systems 2626:513-522, 2003.
Abstract: ASSAVID is an EU-sponsored project which is concerned with the development of a system for the automatic segmentation and semantic annotation of sports video material. In this paper we describe the architecture for a system that automatically creates high-level textual annotation for this material, to create a fully automatic sports video logging process. The proposed technique relies upon the concept of "cues" which attach semantic meaning to low-level features computed on the video and audio. Experimental results on sports video provided by the BBC demonstrate that this method is working well. The system merges and synchronises several streams of cues derived from the video and audio sources, where each stream may have a different latency.

Chum, O.; Matas, J. & Kittler, J.
Locally optimized RANSAC
Proc. Pattern Recognition Symposium (DAGM) 2781:236-243, 2003.
Abstract: A new enhancement of RANSAC, the locally optimized RANSAC (LO-RANSAC), is introduced. It has been observed that, to find an optimal solution (with a given probability), the number of samples drawn in RANSAC is significantly higher than predicted from the mathematical model. This is due to the incorrect assumption, that a model with parameters computed from an outlier-free sample is consistent with all inliers. The assumption rarely holds in practice. The locally optimized RANSAC makes no new assumptions about the data, on the contrary - it makes the above-mentioned assumption valid by applying local optimization to the solution estimated from the random sample. The performance of the improved RANSAC is evaluated in a number of epipolar geometry and homography estimation experiments. Compared with standard RANSAC ,the speed-up achieved is two to three fold and the quality of the solution (measured by the number of inliers) is increased by 10-20%. The number of samples drawn is in good agreement with theoretical predictions

Gräßl, C.; Deinzer, F. & Niemann, H.
Continuous Parametrization of Normal Distributions for Improving the Discrete Statistical Eigenspace Approach for Object Recognition
Int. Conf. on Pattern Recognition and Information Processing 1:73-77, 2003.
Abstract: Statistical approaches play an important role in computer vision, normal distributions especially are widely used. In this paper we present a new approach for a continuous parametrization of normal distributions. Our method is based on arbitrary interpolation techniques. This approach is used to improve the discrete statistical eigenspace approach for object recognition. The continuous parametrization of normal distributions allows an estimation of object poses where no training images were available. In an experiment with real objects we will show that our continuous approach leads to better localization and classification results than the discrete approach.

Gräßl, C.; Zinßer, T. & Niemann, H.
Illumination Insensitive Template Matching with Hyperplanes
Proc. Pattern Recognition Symposium (DAGM) 2781:273-280, 2003.
Abstract: Data-driven object tracking is very important for many vision based applications, because it does not require any previous knowledge about the object to be tracked. In the literature, template matching techniques have successfully been used to solve this task. One promising descendant of these techniques is the hyperplane approach, which is both fast and robust. Unfortunately, like other template matching algorithms, it is inherently sensitive to illumination changes. In this paper, we describe three methods that considerably improve the illumination insensitivity of the hyperplane approach, while retaining the capability of real-time tracking. Experiments conducted on real image sequences prove the efficiency of our enhancements.

Heidemann, G.; Rae, R.; Bekel, H.; Bax, I. & Ritter, H.
Integrating Context Free and Context-Dependent Attentional Mechanisms for Gestural Object Reference
Int. Conf. on Computer Vision Systems 2626:22-33, 2003.
Abstract: We present a vision system for human-machine interaction based on a small wearable camera mounted on glasses. The camera views the area in front of the user, especially the hands. To evaluate hand movements for pointing gestures and to recognise object references, an approach to integrating bottom-up generated feature maps and top-down propagated recognition results is introduced. Modules for context-free focus of attention work in parallel with the hand gesture recognition. In contrast to other approaches, the fusion of the two branches is on the sub-symbolic level. This method facilitates both the integration of different modalities and the generation of auditory feedback.

Jaser, E.; Kittler, J. & Christmas, W.
Building classifier ensembles for automatic sports classification
Proc. Int. Workshop Multiple Classifier Systems 2709:366-374, 2003.
Abstract: Technology has been playing a major role in facilitating the capture, storage and communication of multimedia data, resulting in a large amount of video material being archived. To ensure its usability, the problem of automatic annotation of videos has been attracting the attention of much researches. This paper describes one aspect of the development of a novel system which will provide a semantic annotation of sports video. The system relies upon the concept of "cues" which attach semantic meaning to low-level features computed on the video and audio. We will discuss the problem of classifying shots, based on the cues they contain, into the sports they belong to. We adopt the multiple classifier system (MCS) approach to improve classification performance. Experimental results on sports video materials provided by the BBC demonstrate the benefits of the MCS approach in relation to this difficult classification problem.

Kittler, J.; Ahmadyfard, A. & Windridge, D.
Serial Multiple Classifier Systems Exploiting a Coarse to Fine Output Coding
Proc. Int. Workshop Multiple Classifier Systems 2709:106-114, 2003.
Abstract: We investigate serial multiple classifier system architectures which exploit a hierarchical output coding. Such architectures are known to deliver performance benefits and are widely used in applications involving a large number of classes such as character and handwriting recognition. We develop a theoretical model which underpins this approach to multiple classifier system design and show how it relates to various heuristic design strategies advocated in the literature. The approach is applied to the problem of 3D object recognition in computer vision.

Käster, T.; Pfeiffer, M.; Bauckhage, C. & Sagerer, G.
Combining Speech and Haptics for Intuitive and Efficient Navigation through Image Databases
Int. Conf. on Multimodal Interfaces 180-187, 2003.
Abstract: Given the size of todays professional image databases, the standard approach to object- or theme-related image retrieval is to interactively navigate through the content. But as most users of such databases are designers or artists who do not have a technical background, navigation interfaces must be intuitive to use and easy to learn. This paper reports on efforts towards this goal. We present a system for intuitive image retrieval that features different modalities for interaction. Apart from conventional input devices like mouse or keyboard it is also possible to use speech or haptic gesture to indicate what kind of images one is looking for. Seeing a selection of images on the screen, the user provides relevance feedback to narrow the choice of motifs presented next. This is done either by scoring whole images or by choosing certain image regions. In order to derive consistent reactions from multimodal user input, asynchronous integration of modalities and probabilistic reasoning based on Bayesian networks are applied. After addressing technical details, we will discuss a series of usability experiments, which we conducted to examine the impact of multimodal input facilities on interactive image retrieval. The results indicate that users appreciate multimodality. While we observed little decrease in task performance, measures of contentment exceeded those for conventional input devices.

Ribo, M.; Brandner, M. & Pinz, A.
A flexible software architecture for hybrid tracking
INTERVIS 3:1899-1906, 2003.
Abstract: Fusion of vision-based and inertial pose estimation has many high-potential applications in navigation, robotics, and augmented reality. Our research aims at the development of a fully mobile, completely self-contained tracking system, that is able to estimate sensor motion from known 3D scene structure. This requires a highly modular and scalable software architecture for algorithm design and testing. As the main contribution of this paper, we discuss the design of our hybrid tracker and emphasize important features: scalability, code reusability, testing facilities. In addition, we present a mobile augmented reality application, and several first experiments with a fully mobile vision-inertial sensor head. Our hybrid tracking system is not only capable of real-time performance, but can also be used for offline analysis of tracker performance, comparison with ground truth, and evaluation of several pose estimation and information fusion algorithms.

Siegl, H.; Brandner, M.; Ganster, H.; Lang, P.; Pinz, A.; Ribo, M. & Stock, C.
A Mobile Augmented Reality System
Int. Conf. on Computer Vision Systems 13-14, 2003.
Abstract: Existing augmented reality (AR) applications suffer from restricted mobility and insufficient tracking (head-pose calculation) capabilities to be used in fully mobile, potentially outdoor applications. We present a new AR-kit, which has been designed for modular and flexible use in mobile, stationary, in- and outdoor situations. The system is wearable and consists of two independent subsystems, one for video augmentation and 3D visualization, the other one for real-time tracking fusing vision-based and inertial tracking components. Several AR-kits can be operated simultaneously, communicating via wireless LAN, thus enabling in- and outdoor applications of mobile multiuser AR scenarios.

Siegl, H.; Ganster, H. & Pinz, A.
Mobile AR Setups
Workshop of the Austrian Association for Pattern Recognition (ÖAGM/AAPR) 245-252, 2003.
Abstract: Augmented reality (AR) enriches the perceived reality by additional information with a representation ranging from video annotation or highlighting to projections of complex 3D objects. This technique is used as visual aid for medical and military purposes, for entertainment, for assembly processes or for engineering design or in fully mobile environments like in a city guide application. Concerning AR, the scientific focus of our group is real-time tracking for self localization. In this paper we present different concepts for mobile AR systems consisting of off-the-shelf components.

Stock, C. & Pinz, A.
Similarity Measures for Corner Redetection
Proc of Scandinavian Conf. on Image Aanlysis (SCIA) 2749:133-139, 2003.
Abstract: Corners are important image-features for tracking applications. We present a new method to calculate the similarity of corners, which is used to improve the redetection Performance of corner-based tracking applications. It is a simple and fast method to calculate a scaled measure of similarity, which aggregates basic corner features like dihedral angle, cornerness, and corner orientation. Experimental results verify that the similarity measure is well suited for tracking applications.

Zinßer, T.; Schmidt, J. & Niemann, H.
A Refined ICP Algorithm for Robust 3-D Correspondence Estimation
Int. Conf. on Image Processing 2:695-698, 2003.
Abstract: Robust registration of two 3-D point sets is a common problem in computer vision. The iterative closest point (ICP) algorithm is undoubtedly the most popular algorithm for solving this kind of problem. In this paper, we present the Picky ICP algorithm, which has been created by merging several extensions of the standard ICP algorithm, thus improving its robustness and computation time. Using pure 3-D point sets as input data, we do not consider additional information like point color or neighborhood relations. In addition to the standard ICP algorithm and the Picky ICP algorithm proposed in this paper, a robust algorithm due to Masuda and Yokoya and the RICP algorithm by Trucco et al. are evaluated. We have experimentally determined the basin of convergence, robustness to noise and outliers, and computation time of these four ICP based algorithms.

Zinßer, T.; Schmidt, J. & Niemann, H.
Performance Analysis of Nearest Neighbour Algorithms for ICP Registration of 3-D Points
Int. Workshop on Vision, Modeling, and Visualization 199-206, 2003.
Abstract: There are many nearest neighbor algorithms tailor-made for ICP, but most of them require special input data like range images or triangle meshes. We focus on efficient nearest neighbor algorithms that do not impose this limitation, and thus can also be used with 3-D point sets generated by structure-from-motion techniques. We shortly present the evaluated algorithms and introduce the modifications we made to improve their efficiency. In particular, several enhancements to the well-known k-D tree algorithm are described. The first part of our performance analysis consists of experiments on synthetic point sets, whereas the second part features experiments with the ICP algorithm on real point sets. Both parts are completed by a thorough evaluation of the obtained results.

Ahmadyfard, A. & Kittler, J.
A comparative study of two object recognition methods
Proc. British Machine Vision Conference 363-372, 2002.
Abstract: An experimental comparative study between two representation methods for the recognition of 3D objects from a 2D view is carried out. The two methods compared are our ARG region-based representation and the elliptic region-based method of Tuytelaars et al. The results of the experiments conducted show that the former method outperforms the latter particularly under sever scaling and also when applied to objects with curved surfaces.

Stock, C.; Mühlmann, U.; Chandraker, M. & Pinz, A.
Subpixel Corner Detection for Tracking Applications using CMOS Camera Technology
Workshop of the Austrian Association for Pattern Recognition (ÖAGM/AAPR) 191-199, 2002.
Abstract: A multistage approach to gray-level corner detection is proposed in this paper, which is based on fast corner extraction using the Plessey corner detector combined with CMOS image acquisition technologies and localization refinement using a spatial subpixel analysis approach. The proposed corner detector detects corners as the intersection points of the involved edges, only by using a small neighborhood of the estimated corner position. With this approach it is also possible to compute the corner orientation and the dihedral angle of the corner. In comparison to the standard Plessey detector, which can show localization errors of several pixels, experimental results show an average error of only 0.36 pixels for our algorithm.