In computer vision communities such as stereo or visual tracking, commonly
accepted and widely used benchmarks consolidate existing research and
boost scientific progress. However, with their great benefit and impact also
comes great responsibility as flaws and biases may hamper and
mislead scientific progress. Characteristics such as non-representative scene
content, inaccurate ground truth, averaging error metrics or
scalar based rankings may create unwanted incentives and support overfitting.
Most of these drawbacks are negligible at the time a new,
challenging benchmark is published but they become more urgent as algorithm
performance improves and approaches the level of accuracy that can
be reflected by the given ground truth and error metrics. In this talk, we
present lessons learned and best practices derived from analyzing
popular benchmarks of various vision communities. We argue that a considerate
benchmark should incorporate conscious decisions on scene content,
data acquisition, error metrics and their mutual influence on each other. We
further present our contributions towards more deliberate
benchmarks which allow specific and comprehensive performance characterizations.
We elaborate on scene content decisions, data acquisition
details, geometry-aware performance measures, and the influence of
application-specific requirements when creating benchmark datasets such as
the HCI Benchmark Suite for stereo and optical flow.
Joint work with Katrin Honauer, Heidelberg Collaboratory for Image Processing
(HCI); Heidelberg University, Germany