The presented system gradually retrieves more information about the scene and the camera setup. Images contain a huge amount of information (e.g.
color pixels). However, a lot of it is redundant (which explains the success of image compression algorithms). The structure recovery approaches require correspondences between the different images (i.e. image points originating from the same scene point). Due to the combinatorial nature of this problem it is almost impossible to work on the raw data. The first step therefore consists of extracting features. The features of different images are then compared using similarity measures and lists of potential matches are established. Based on these the relation between the views are computed. Since wrong correspondences can be present, robust algorithms are used.
Once consecutive views have been related to each other, the structure of the features and the motion of the camera is computed.
An initial reconstruction is then made for the first two images of the sequence. For the subsequent images the camera pose is estimated in the frame defined by the first two cameras. For every additional image that is processed at this stage, the features corresponding to points in previous images are reconstructed, refined or corrected. Therefore it is not necessary that the initial points stay visible throughout the entire sequence. The result of this step is a reconstruction of typically a few hundred feature points.
When uncalibrated cameras are used this structure and motion is only determined up to an arbitrary projective transformation. The next step consists of restricting this ambiguity to metric (i.e. Euclidean up to an arbitrary scale factor) through self-calibration.
In a projective reconstruction not only the scene, but also the camera is distorted. Since the algorithm deals with unknown scenes, it has no way of identifying this distortion in the reconstruction. Although the camera is also assumed to be unknown, some constraints on the intrinsic camera parameters (e.g. rectangular or square pixels, constant aspect ratio, principal point in the middle of the image, ...) can often still be assumed. A distortion on the camera mostly results in the violation of one or more of these constraints. A metric reconstruction/calibration is obtained by transforming the projective reconstruction until all the constraints on the cameras intrinsic parameters are satisfied.
At this point enough information is available to go back to the images and look for correspondences for all the other image points. This search is facilitated since the line of sight corresponding to an image point can be projected to other images, restricting the search range to one dimension. By pre-warping the image -this process is called rectification- standard stereo matching algorithms can be used. This step allows to find correspondences for most of the pixels in the images.
From these correspondences the distance from the points to the camera center can be obtained through triangulation. These results are refined and completed by combining the correspondences from multiple images.
Finally all results are integrated in a textured 3D surface reconstruction of the scene under consideration. The model is obtained by approximating the depth map
with a triangular wire frame. The texture is obtained from the images and mapped onto the surface. An overview of the systems is given in figure 1.7.
Throughout the rest of the text the different steps of the method will be explained in more detail. An image sequence of the Arenberg castle in Leuven will be used for illustration. Some of the images of this sequence can be seen in Figure 1.2. The full sequence consists of 24 images recorded with a video camera.
![]() |