Ahmet Iscen presents Retrieval-Augmented Media Understanding
On 2023-05-30 11:00
at G205, Karlovo náměstí 13, Praha 2
Retrieval augmented models are becoming increasingly popular for computer
vision
tasks after their recent success in NLP problems. The goal is to enhance the
recognition capabilities of the model by retrieving similar examples for the
visual input from an external memory set. We first propose an end-to-end
Retrieval-Augmented Visual Language Model (REVEAL) that learns to encode world
knowledge into a large-scale memory, and to retrieve from it to answer
knowledge-intensive queries. REVEAL consists of four key components: the
memory,
the encoder, the retriever and the generator. The large-scale memory encodes
various sources of multimodal world knowledge (e.g. image-text pairs, question
answering pairs, knowledge graph triplets, etc) via a unified encoder. The
retriever finds the most relevant knowledge entries in the memory, and the
generator fuses the retrieved knowledge with the input query to produce the
output. A key novelty in our approach is that the memory, encoder, retriever
and
generator are all pre-trained end-to-end on a massive amount of data.
Furthermore, our approach can use a diverse set of multimodal knowledge
sources,
which is shown to result in significant gains. We show that REVEAL achieves
state-of-the-art results on visual question answering and image captioning.
Secondly, we show the benefit of retrieval-augmented models for classification
tasks, where we introduce an attention-based memory module, which learns the
importance of each retrieved example from the memory. We evaluate our method in
three different classification tasks, namely long-tailed recognition, learning
with noisy labels, and fine-grained classification, and show that it achieves
state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.
vision
tasks after their recent success in NLP problems. The goal is to enhance the
recognition capabilities of the model by retrieving similar examples for the
visual input from an external memory set. We first propose an end-to-end
Retrieval-Augmented Visual Language Model (REVEAL) that learns to encode world
knowledge into a large-scale memory, and to retrieve from it to answer
knowledge-intensive queries. REVEAL consists of four key components: the
memory,
the encoder, the retriever and the generator. The large-scale memory encodes
various sources of multimodal world knowledge (e.g. image-text pairs, question
answering pairs, knowledge graph triplets, etc) via a unified encoder. The
retriever finds the most relevant knowledge entries in the memory, and the
generator fuses the retrieved knowledge with the input query to produce the
output. A key novelty in our approach is that the memory, encoder, retriever
and
generator are all pre-trained end-to-end on a massive amount of data.
Furthermore, our approach can use a diverse set of multimodal knowledge
sources,
which is shown to result in significant gains. We show that REVEAL achieves
state-of-the-art results on visual question answering and image captioning.
Secondly, we show the benefit of retrieval-augmented models for classification
tasks, where we introduce an attention-based memory module, which learns the
importance of each retrieved example from the memory. We evaluate our method in
three different classification tasks, namely long-tailed recognition, learning
with noisy labels, and fine-grained classification, and show that it achieves
state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.