Categorization of textual document into hierarchical taxonomies

                       Domonkos Tikk

                        TU Budapest
                          
                          Abstract

Traditionally, document categorization has been performed
manually. However, as the number of documents explosively increased,
the task became no longer amenable to the manual categorization,
requiring a vast amount of time and cost. This has lead to numerous
researches for automatic document classification.  A text classifier
assign a document to appropriate category/ies, also called topic, in a
predefined set of categories.

Originally, research in text categorization addressed the binary
problem, where a document is either relevant or not w.r.t. a given
category. In real-world situation, however, the great variety of
different sources and hence categories usually poses multi-class
classification problem, where a document belongs to exactly one
category selected from a predefined set As the number of topics
becomes larger, multi-class categorizers face the problem of
complexity that may incur rapid increase of time and storage, and
compromise the perspicuity of categorized subject domain. A common way
to manage complexity is using a hierarchy, and text is no exception.
Internet directories and large on-line databases are often organized
as hierarchies; see e.g. Yahoo (http://www.yahoo.com).

The talk presents hierarchical text categorization approach with some
experimental results.  The main part of the approach is an iterative
learning module that gradually trains the classifier to recognize
constitutive characteristics of categories and hence to discriminate
typical documents belonging to different categories. The iterative
learning helps to avoid overfitting of training data and to refine
characteristics of categories. Testing was performed on the well-known
Reuters-21578 document collection, and on the WIPO-alpha (World
Intellectual Property Organization) patent database.