Nikos Efthymiadis1*, Bill Psomas1,2, Zakaria Laskar1, Konstantinos Karantzalos2, Yannis Avrithis3, Ondřej Chum1, Giorgos Tolias1
1VRG, FEE, Czech Technical University in Prague 2National Technical University of Athens
3Institute of Advanced Research in Artificial Intelligence (IARAI), Austria
*Corresponding: efthynik@fel.cvut.cz
We introduce FREEDOM, a training-free, composed image retrieval (CIR) method for domain conversion based on vision-language models (VLMs). Given an image query and a text query that names a domain, images are retrieved having the class of the mage query and the domain of the text query. A range of applications is targeted, where classes can be defined at category level (a,b) or instance level (c), and domains can be defined as styles (a, c), or context (b). In the above visualization, for each image query, retrieved images are shown for different text queries.
In this paper, we focus on a specific variant of composed image retrieval, namely domain conversion, where the text query defines the target domain. Unlike conventional cross-domain retrieval, where models are trained to use queries of a source domain and retrieve items from another target domain, we address a more practical, open-domain setting, where the query and database may be from any unseen domain. We target different variants of this task, where the class of the query object is defined at category-level (a, b) or instance-level (c). At the same time, the domain corresponds to descriptions of style (a, c) or context (b). Even though domain conversion is a subset of the tasks handled by existing CIR methods, the variants considered in our work reflect a more comprehensive set of applications than what was encountered in prior art.
Given a query image and a query text indicating the target domain, proxy images are first retrieved from the query through an image-to-image search over a visual memory. Then, a set of text labels is associated with each proxy image through an image-to-text search over a textual memory. Each of the most frequent text labels is combined with the query text in the text space, and images are retrieved from the database by text-to-image search. The resulting sets of similarities are linearly combined with the frequencies of occurrence as weights. Below: k=4 proxy images, n=3 text labels per proxy image, m=2 most frequent text labels.
| Method | CAR | ORI | PHO | SCU | TOY | AVG |
|---|---|---|---|---|---|---|
| Text | 0.82 | 0.63 | 0.68 | 0.78 | 0.78 | 0.74 |
| Image | 4.27 | 3.12 | 0.84 | 5.86 | 5.08 | 3.84 |
| Text + Image | 6.61 | 4.45 | 2.17 | 9.18 | 8.62 | 6.21 |
| Text × Image | 8.21 | 5.62 | 6.98 | 8.95 | 9.41 | 7.83 |
| FreeDom | 35.97 | 11.80 | 27.97 | 36.58 | 37.21 | 29.91 |
| Method | CLIP | PAINT | PHO | SKE | AVG |
|---|---|---|---|---|---|
| Text | 0.63 | 0.52 | 0.63 | 0.51 | 0.57 |
| Image | 7.15 | 7.31 | 4.38 | 7.78 | 6.66 |
| Text + Image | 9.59 | 9.97 | 9.22 | 8.53 | 9.33 |
| Text × Image | 9.01 | 8.66 | 15.87 | 5.90 | 9.86 |
| FreeDom | 41.96 | 31.65 | 41.12 | 34.36 | 37.27 |
| Method | AUT | DIM | GRA | OUT | ROC | WAT | AVG |
|---|---|---|---|---|---|---|---|
| Text | 1.00 | 0.99 | 1.15 | 1.23 | 1.10 | 1.05 | 1.09 |
| Image | 6.45 | 4.85 | 5.67 | 7.67 | 7.53 | 8.75 | 6.82 |
| Text + Image | 8.46 | 6.58 | 9.22 | 11.91 | 11.20 | 8.41 | 9.30 |
| Text × Image | 8.24 | 6.36 | 12.11 | 12.71 | 10.46 | 8.84 | 9.79 |
| FreeDom | 24.35 | 24.41 | 30.06 | 30.51 | 26.92 | 20.37 | 26.10 |
| Method | TODAY | ARCHIVE | AVG |
|---|---|---|---|
| Text | 5.28 | 6.16 | 5.72 |
| Image | 8.47 | 24.51 | 16.49 |
| Text + Image | 9.60 | 26.13 | 17.86 |
| Text × Image | 16.42 | 29.90 | 23.16 |
| FreeDom | 30.95 | 35.52 | 33.24 |