Multilingual information retrieval
Abstract
Multilingual Information Retrieval has been used to refer to
various tasks ranging from monolingual IR in languages other than
English to IR on single documents containing text in more than one
language. We are addressing stemming in a multilingual
context. Because of the presence of different languages, stemming
implies a more complex work than in a classical mono-lingual
context. We developed a language independent stemming
methodology, called SPLIT (Stemming Program for Language
Independent Tasks), which allows us to build a stemming algorithm
for a specific language without a-priori linguistic knowledge on
the language morphology, but inferring it directly from the corpus
of documents.
Description
Multilingual Information Retrieval (MLIR) Research. MLIR has been
used to refer to various tasks ranging from monolingual IR in
languages other than English to IR on single documents containing
text in more than one language.
In particular, we are studying the problems related to the
stemming process in a multilingual context. Stemming is used to
reduce variant word forms to a common morphological root, in order
to reduce differences among documents and queries
vocabulary. Because of the presence of different languages,
stemming implies a more complex work than in a classical
mono-lingual context, because a stemmer should be available for
each language used in a document collection or in an end user's
query.
We developed a language independent stemming methodology, called
SPLIT (Stemming Program for Language Independent Tasks), which
allows us to build a stemming algorithm for a specific language
without a-priori linguistic knowledge on the language morphology,
but inferring it directly from the corpus of documents. The basic
idea of SPLIT is that good prefixes (stems) point to good suffixes
(derivations) and good suffixes are pointed to by good prefixes.
It uses a graph model to represent words, and the notion of mutual
reinforcing relationship between stems and derivations to estimate
the degree of which the prefix of a word can be the stem for that
word.
We evaluate this stemming methodology for Italian and English, and
the results are encouraging because it performs as effectively as
stemming algorithm based on a-priori linguistic knowledge
(Porter-like).
We are interested to test this methodology for further languages,
and to improve our "graph word model" generalizing the number of
possible splits, which is fixed to 2 at the moment, from 0 (no
split) to n (for example word compounding), and inserting directly
into the model the linguistic knowledge which can be available to
the developer, by weighting the links between two nodes.
Essential Bibliography
[1] M. Bacchin, N. Ferro, M. Melucci.
"The Effectiveness of a Graph-based Algorithm for Stemming",
Proceedings of International Conference on Asian Digital Libraries 2002,
Lecture Notes in Computer Science series, Springer Verlag, 2002, Singapore.
[2] M. Bacchin, N. Ferro, M. Melucci.
"University of Padua at CLEF 2002: Experiments to Evaluate a Statistical
Stemming Algorithm",
CLEF 2002 Workshop Working Note, Sep, 2002, Rome, Italy.
Michela Bacchin
Last modified: Thu Oct 17 11:19:36 CEST 2002