Composed Image Retrieval for Training-Free Domain Conversion

WACV 2025 Oral

Nikos Efthymiadis1*, Bill Psomas1,2, Zakaria Laskar1, Konstantinos Karantzalos2, Yannis Avrithis3, Ondřej Chum1, Giorgos Tolias1

1VRG, FEE, Czech Technical University in Prague 2National Technical University of Athens

3Institute of Advanced Research in Artificial Intelligence (IARAI), Austria

*Corresponding: efthynik@fel.cvut.cz

[arxiv] [code] [poster] [video] [bibtex]

Overview

teaser

We introduce FREEDOM, a training-free, composed image retrieval (CIR) method for domain conversion based on vision-language models (VLMs). Given an image query and a text query that names a domain, images are retrieved having the class of the mage query and the domain of the text query. A range of applications is targeted, where classes can be defined at category level (a,b) or instance level (c), and domains can be defined as styles (a, c), or context (b). In the above visualization, for each image query, retrieved images are shown for different text queries.

Motivation

In this paper, we focus on a specific variant of composed image retrieval, namely domain conversion, where the text query defines the target domain. Unlike conventional cross-domain retrieval, where models are trained to use queries of a source domain and retrieve items from another target domain, we address a more practical, open-domain setting, where the query and database may be from any unseen domain. We target different variants of this task, where the class of the query object is defined at category-level (a, b) or instance-level (c). At the same time, the domain corresponds to descriptions of style (a, c) or context (b). Even though domain conversion is a subset of the tasks handled by existing CIR methods, the variants considered in our work reflect a more comprehensive set of applications than what was encountered in prior art.

Approach

teaser

Given a query image and a query text indicating the target domain, proxy images are first retrieved from the query through an image-to-image search over a visual memory. Then, a set of text labels is associated with each proxy image through an image-to-text search over a textual memory. Each of the most frequent text labels is combined with the query text in the text space, and images are retrieved from the database by text-to-image search. The resulting sets of similarities are linearly combined with the frequencies of occurrence as weights. Below: k=4 proxy images, n=3 text labels per proxy image, m=2 most frequent text labels.

Results with CLIP ViT-L/14

ImageNet-R
Method CAR ORI PHO SCU TOY AVG
Text0.820.630.680.780.780.74
Image4.273.120.845.865.083.84
Text + Image6.614.452.179.188.626.21
Text × Image8.215.626.988.959.417.83
FreeDom35.9711.8027.9736.5837.2129.91
MiniDomainNet
Method CLIP PAINT PHO SKE AVG
Text0.630.520.630.510.57
Image7.157.314.387.786.66
Text + Image9.599.979.228.539.33
Text × Image9.018.6615.875.909.86
FreeDom41.9631.6541.1234.3637.27
NICO++
Method AUT DIM GRA OUT ROC WAT AVG
Text1.000.991.151.231.101.051.09
Image6.454.855.677.677.538.756.82
Text + Image8.466.589.2211.9111.208.419.30
Text × Image8.246.3612.1112.7110.468.849.79
FreeDom24.3524.4130.0630.5126.9220.3726.10
LTLL
Method TODAY ARCHIVE AVG
Text5.286.165.72
Image8.4724.5116.49
Text + Image9.6026.1317.86
Text × Image16.4229.9023.16
FreeDom30.9535.5233.24