Automatic segmentation and alignment
Abstract
The segmentation of an unstructured media in a subset of coherent
parts is an important task that may have applications in different
domains. Our approach is based on the idea that the segmentation
of a textual document in coherent excerpts can be the first step
toward an effective automatic document annotation and
summarization. The approach for automatic segmentation will be
based on the use of hidden Markov models (HMMs).
Description
The segmentation of an unstructured media in a subset of coherent
parts is an important task that may have applications in different
domains. For instance it can be used to detect topic changes
inside the automatic transcription of broadcasting news, or to
index parts of a textual document for a more effective retrieval,
or to superimpose a structure to an unstructured document. The
research on this area is quite active, especially under the
umbrella of Topic Detection and Tracking initiative, which has
been created to compare system performances on the detection of
topic changes and on the recognition of topic subjects.
The approach that recently started at IMS is based on the idea
that the segmentation of a textual document in coherent excerpts
can be the first step toward an effective automatic document
annotation. An annotation is considered as the act of modifying
the structure and/or the content of a textual document in order to
add further information to the reader or the ease the access to
the document content. In particular, it is considered that the
automatic segmentation can be coupled with an automatic
summarization of topic content and that the summarization can play
the role of an automatic annotation.
Hence, the approach allows for two levels of annotation: the
automatic segmentation adds a structure to the document, and so it
can be considered as an annotation which highlights the
introduction of a new concept; the automatic summarization
modifies the document content, and so it can be considered as an
annotation which describes or introduces each new concept.
The approach for automatic segmentation will be based on the use
of hidden Markov models (HMMs), a tool that has been successfully
applied at IMS for the segmentation and alignment of unstructured
audio recordings (please refer to Music Information
Retrieval for a detailed description of the approach). HMMs
have been applied to the task of automatic segmentation by other
research group, with encouraging results.
Nicola Orio
Last modified: Mon Nov 04 14:32:12 CEST 2002