Categorization of textual document into hierarchical taxonomies Domonkos Tikk TU Budapest Abstract Traditionally, document categorization has been performed manually. However, as the number of documents explosively increased, the task became no longer amenable to the manual categorization, requiring a vast amount of time and cost. This has lead to numerous researches for automatic document classification. A text classifier assign a document to appropriate category/ies, also called topic, in a predefined set of categories. Originally, research in text categorization addressed the binary problem, where a document is either relevant or not w.r.t. a given category. In real-world situation, however, the great variety of different sources and hence categories usually poses multi-class classification problem, where a document belongs to exactly one category selected from a predefined set As the number of topics becomes larger, multi-class categorizers face the problem of complexity that may incur rapid increase of time and storage, and compromise the perspicuity of categorized subject domain. A common way to manage complexity is using a hierarchy, and text is no exception. Internet directories and large on-line databases are often organized as hierarchies; see e.g. Yahoo (http://www.yahoo.com). The talk presents hierarchical text categorization approach with some experimental results. The main part of the approach is an iterative learning module that gradually trains the classifier to recognize constitutive characteristics of categories and hence to discriminate typical documents belonging to different categories. The iterative learning helps to avoid overfitting of training data and to refine characteristics of categories. Testing was performed on the well-known Reuters-21578 document collection, and on the WIPO-alpha (World Intellectual Property Organization) patent database.