Automatic Text Categorization (ATC), the assignment of natural language written texts to one or more predefined categories, is an important task of a lot of management information applications. For example, it might be used as an indexing mechanism for text retrieval, as a component of an information filtering system, and as end of itself when categorization of documents is of iterest. During the last years, the main approach to the problem has been based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristic of the categories. This approach has led to effective algorithms, considerable savings in terms of time, and straightforward portability to different domains.
Automatic Text Categorization, the activity of labelling natural language texts with thematic categories has gained, in the last ten years, a prominent status in the information filtering field due to the increased availability of document in digital forms and the need to access them in different ways. ATC is now being applied in many contexts, ranging from document indexing based on a controlled vocabulary, to document filtering, automated metadata generation, word sense disambiguation, population of digital libraries and so on. Until the late 80s, the most popular approach to ATC was a knowledge engineering one. In the 90s, the machine learning paradigm, in which a general inductive process automatically builds an automatic text classifiers by learning, from a set of preclassified documents, the characteristics of the categories, grew in popularity due to its advantages: a high accuracy, and, most important, a considerable savings in term of expert labor power. Nowadays, ATC is a discipline at the crossroads of machine learning and information retrieval, and involves, even though there is still a lack of standardization, a number of researchers . Moreover, recently a new task arose from the need of indexing the high amount of web pages available on the World Wide Web; this task is known as Web Page Categorization (WPC) and slightly differs from ATC. In fact, while unrestricted English texts have an intuitive and easy structure, HTML documents are extremely diverse (home pages, resource lists, active pages, etc.) and, usually, are quite short and synthetic, and pose a new challenge for the automatic classification.