Composed Image Retrieval for Training-Free Domain Conversion

WACV 2025 Oral

Nikos Efthymiadis^1*, Bill Psomas^1,2, Zakaria Laskar¹, Konstantinos Karantzalos², Yannis Avrithis³, Ondřej Chum¹, Giorgos Tolias¹

¹VRG, FEE, Czech Technical University in Prague ²National Technical University of Athens

³Institute of Advanced Research in Artificial Intelligence (IARAI), Austria

^*Corresponding: efthynik@fel.cvut.cz

[arxiv] [code] [poster] [video] [bibtex]

Overview

We introduce FREEDOM, a training-free, composed image retrieval (CIR) method for domain conversion based on vision-language models (VLMs). Given an image query and a text query that names a domain, images are retrieved having the class of the mage query and the domain of the text query. A range of applications is targeted, where classes can be defined at category level (a,b) or instance level (c), and domains can be defined as styles (a, c), or context (b). In the above visualization, for each image query, retrieved images are shown for different text queries.

Motivation

In this paper, we focus on a specific variant of composed image retrieval, namely domain conversion, where the text query defines the target domain. Unlike conventional cross-domain retrieval, where models are trained to use queries of a source domain and retrieve items from another target domain, we address a more practical, open-domain setting, where the query and database may be from any unseen domain. We target different variants of this task, where the class of the query object is defined at category-level (a, b) or instance-level (c). At the same time, the domain corresponds to descriptions of style (a, c) or context (b). Even though domain conversion is a subset of the tasks handled by existing CIR methods, the variants considered in our work reflect a more comprehensive set of applications than what was encountered in prior art.

Approach

Given a query image and a query text indicating the target domain, proxy images are first retrieved from the query through an image-to-image search over a visual memory. Then, a set of text labels is associated with each proxy image through an image-to-text search over a textual memory. Each of the most frequent text labels is combined with the query text in the text space, and images are retrieved from the database by text-to-image search. The resulting sets of similarities are linearly combined with the frequencies of occurrence as weights. Below: k=4 proxy images, n=3 text labels per proxy image, m=2 most frequent text labels.

Results with CLIP ViT-L/14

ImageNet-R
Method	CAR	ORI	PHO	SCU	TOY	AVG
Text	0.82	0.63	0.68	0.78	0.78	0.74
Image	4.27	3.12	0.84	5.86	5.08	3.84
Text + Image	6.61	4.45	2.17	9.18	8.62	6.21
Text × Image	8.21	5.62	6.98	8.95	9.41	7.83
FreeDom	35.97	11.80	27.97	36.58	37.21	29.91

MiniDomainNet
Method	CLIP	PAINT	PHO	SKE	AVG
Text	0.63	0.52	0.63	0.51	0.57
Image	7.15	7.31	4.38	7.78	6.66
Text + Image	9.59	9.97	9.22	8.53	9.33
Text × Image	9.01	8.66	15.87	5.90	9.86
FreeDom	41.96	31.65	41.12	34.36	37.27

NICO++
Method	AUT	DIM	GRA	OUT	ROC	WAT	AVG
Text	1.00	0.99	1.15	1.23	1.10	1.05	1.09
Image	6.45	4.85	5.67	7.67	7.53	8.75	6.82
Text + Image	8.46	6.58	9.22	11.91	11.20	8.41	9.30
Text × Image	8.24	6.36	12.11	12.71	10.46	8.84	9.79
FreeDom	24.35	24.41	30.06	30.51	26.92	20.37	26.10

LTLL
Method	TODAY	ARCHIVE	AVG
Text	5.28	6.16	5.72
Image	8.47	24.51	16.49
Text + Image	9.60	26.13	17.86
Text × Image	16.42	29.90	23.16
FreeDom	30.95	35.52	33.24