Automatic segmentation and alignment

Abstract

The segmentation of an unstructured media in a subset of coherent parts is an important task that may have applications in different domains. Our approach is based on the idea that the segmentation of a textual document in coherent excerpts can be the first step toward an effective automatic document annotation and summarization. The approach for automatic segmentation will be based on the use of hidden Markov models (HMMs).

Description

The segmentation of an unstructured media in a subset of coherent parts is an important task that may have applications in different domains. For instance it can be used to detect topic changes inside the automatic transcription of broadcasting news, or to index parts of a textual document for a more effective retrieval, or to superimpose a structure to an unstructured document. The research on this area is quite active, especially under the umbrella of Topic Detection and Tracking initiative, which has been created to compare system performances on the detection of topic changes and on the recognition of topic subjects. The approach that recently started at IMS is based on the idea that the segmentation of a textual document in coherent excerpts can be the first step toward an effective automatic document annotation. An annotation is considered as the act of modifying the structure and/or the content of a textual document in order to add further information to the reader or the ease the access to the document content. In particular, it is considered that the automatic segmentation can be coupled with an automatic summarization of topic content and that the summarization can play the role of an automatic annotation. Hence, the approach allows for two levels of annotation: the automatic segmentation adds a structure to the document, and so it can be considered as an annotation which highlights the introduction of a new concept; the automatic summarization modifies the document content, and so it can be considered as an annotation which describes or introduces each new concept. The approach for automatic segmentation will be based on the use of hidden Markov models (HMMs), a tool that has been successfully applied at IMS for the segmentation and alignment of unstructured audio recordings (please refer to Music Information Retrieval for a detailed description of the approach). HMMs have been applied to the task of automatic segmentation by other research group, with encouraging results.


Nicola Orio
Last modified: Mon Nov 04 14:32:12 CEST 2002