Object Recognition as Machine Translation – Part 2: Exploiting Image Database Clustering Models

We treat object recognition as a process of attaching words to images and image regions. To accomplish this we exploit clustering methods which learn the joint statistics of words and image regions. We show how these models can then be used to attach words to images outside the training set. This “auto-annotation” process has applications such as image indexing, as well as being related to object recognition. Predicted words can be compared to actual words associated with images in a held out set, and we introduce several performance measures based on this observation. These measures are then used to make principled comparisons of model variants, and proposed enhancements. Word prediction is most simply done as a function of the entire image. However, for recognition we need to learn the correspondence between words and specific image regions. Here we first show that the existing models can be used for this purpose, and then we propose modifications to improve performance based on this goal. Finally, we propose word prediction performance as a segmentation measure and report the results for two segmentation approaches.

[1]  Thomas Hofmann,et al.  Statistical Models for Co-occurrence Data , 1998 .

[2]  Thomas Hofmann,et al.  Learning and representing topic-a hierarchical mixture model for word occurences in document databas , 1998 .

[3]  Jean Ponce,et al.  Computer Vision: A Modern Approach , 2002 .

[4]  Margaret M. Fleck Multiple widths yield reliable finite differences , 1990, [1990] Proceedings Third International Conference on Computer Vision.

[5]  Dan Tufis,et al.  Empirical Methods for Exploiting Parallel Texts , 2002, Lit. Linguistic Comput..

[6]  Neill W. Campbell,et al.  Interpreting image databases by region classification , 1997, Pattern Recognit..

[7]  Margaret M. Fleck Multiple Widths Yield Reliable Finite Differences (Computer Vision) , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[9]  Peter G. B. Enser,et al.  Progress in Documentation Pictorial Information Retrieval , 1995, J. Documentation.

[10]  Neill W. Campbell,et al.  Automatic Segmentation and Classification of Outdoor Images Using Neural Networks , 1997, Int. J. Neural Syst..

[11]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[12]  James Ze Wang,et al.  SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Sudeep Sarkar,et al.  A Framework for Performance Characterization of Intermediate-Level Grouping Modules , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Oded Maron,et al.  Learning from Ambiguity , 1998 .

[15]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Jitendra Malik,et al.  Blobworld: Image Segmentation Using Expectation-Maximization and Its Application to Image Querying , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  David A. Forsyth,et al.  Clustering art , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[18]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[19]  Hayit Greenspan,et al.  Color- and Texture-based Image Segmentation Using the Expectation-Maximization Algorithm and its Application to Content-Based Image Retrieval. , 1998, ICCV 1998.