Analysis of gesture and action in technical talks for video indexing

We present an automatic system for analyzing and annotating video sequences of technical talks. Our method uses a robust motion estimation technique to detect key frames and segment the video sequence into subsequences containing a single overhead slide. The subsequences are stabilized to remove motion that occurs when the speaker adjusts their slides. Any changes remaining between frames in the stabilized sequences may be due to speaker gestures such as pointing or writing and we use active contours to automatically track these potential gestures. Given the constrained domain we define a simple "vocabulary" of actions which can easily be recognized based on the active contour shape and motion. The recognized actions provide a rich annotation of the sequence that can be used to access a condensed version of the talk from a web page.

[1]  Edward H. Adelson,et al.  Layered representation for motion analysis , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Aaron F. Bobick,et al.  A state-based technique for the summarization and recognition of gesture , 1995, Proceedings of IEEE International Conference on Computer Vision.

[3]  C. Jennison,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[4]  Allan D. Jepson,et al.  Computational Perception of Scene Dynamics , 1996, ECCV.

[5]  J. Gibson The Ecological Approach to Visual Perception , 1979 .

[6]  Gudula Retz-Schmidt,et al.  A REPLAI of SOCCER: Recognizing Intentions in the Domain of Soccer Games , 1988, European Conference on Artificial Intelligence.

[7]  ZhangHongJiang,et al.  Automatic partitioning of full-motion video , 1993 .

[8]  Matthew Brand,et al.  Understanding manipulation in video , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[9]  Scott L. Minneman,et al.  A confederation of tools for capturing and accessing collaborative activity , 1995, MULTIMEDIA '95.

[10]  Yoshinobu Tonomura,et al.  Projection-detecting filter for video cut detection , 1994, MULTIMEDIA '93.

[11]  Michael J. Black,et al.  The Robust Estimation of Multiple Motions: Parametric and Piecewise-Smooth Flow Fields , 1996, Comput. Vis. Image Underst..

[12]  Aaron F. Bobick,et al.  Closed-world tracking , 1995, Proceedings of IEEE International Conference on Computer Vision.

[13]  Thomas Rist,et al.  On the Simultaneous Interpretation of Real World Image Sequences and their Natural Language Description: The System Soccer , 1988, ECAI.

[14]  Ramin Zabih,et al.  Video browsing using edges and motion , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.