Classification of Images on Internet by Visual and Textual Information

Theo Gevers and Frank Aldershoff
Faculty of Science, University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam, The Netherlands

Keywords: WWW, HTML, image search engines, Internet image classification, supervised learning, combining textual/image features, image databases

In this paper, we study computational models and techniques to combine textual and image features for classification of images on Internet. A framework is given to index images on the basis of textual, pictorial and composite (textual-pictorial) information. The scheme makes use of weighted document terms and color invariant image features to obtain a high-dimensional similarity descriptor to be used as an index. Based on supervised learning, the k-nearest neighbor classifier is used to organize images into semantically meaningful groups of Internet images. Internet images are first classified into photographical and synthetical images. After classifying images into photographical and synthetical (artwork) images, we further classify photographical images into portraits (i.e. the image contains a substantial face) and non-portraits. Further, synthetical images are classified into button and non-button images.

Experiments have been conducted on a large set of images down loaded from Internet evaluating the accuracy of combining textual and pictorial information for classification. From the experimental results it is concluded that for classifying images into photographic/synthetic classes, the contribution of image and textual features is equally important. Consequently, high discriminative classification power is obtained based on composite information. Classifying images into portraits/non-portraits shows that pictorial information is more important then textual information. This is due to the inconsistent textual image descriptions, such as surnames, assigned to portrait images which we found on Internet. Hence, only marginal improvement in performance is achieved by using composite information for classifying into portrait and non-portrait classes. Extensions have been made by adding a list of surnames in the training set enhancing the classification rate significantly.