Tomáš Skopal
Prague Computer Science Seminar: Similarity Search in Unstructured Data
On 2017-02-23 16:00
Nowadays, in the "Big Data" era, we often encounter data that come from sensors
digitizing the "signals of nature", where their technical data structure is used
merely for manipulation and reproduction. We often think of multimedia (image,
audio) as the prominent non-structured data types, however, general sensory data
are much more diverse. There are abstract similarity models used for searching
non-structured data, where the data entities are represented by domain-specific
descriptors (e.g., high-dimensional vectors, time series or strings). The
similarity of two entities is then measured as a distance of their descriptors,
so the problem is geometrized as searching for the nearest descriptors to a
descriptor of the query object.

The geometry of similarity spaces is very important for database indexing, i.e.,
techniques for speeding up the search, but also for modeling the similarity and
the descriptor itself. In the talk we will show that the implicit Euclidean
perception of space is not the only possibility; the more general metric space
model is also very popular. One could even develop unique distance spaces whose
topological properties are directly derived from the data. We will also discuss
problems related to similarity modeling, especially the choice between semantic
descriptors and smart similarity functions.
Back to the list