The Representation and Analysis of Document Images Mark Burge Department of Systems Science Johannes Kepler University Linz, Austria Document image analysis is the process of extracting the primitives in an image of a document and recreating the document's structure in a machine understandable format. It is a very application driven field, and systems are usually designed for a specific collection of documents from a single domain. This talk presents a framework which supports the construction of algorithms for the analysis of document images from different domains. It introduces the Canonical set, a domain independent representation for document images in which their high level primitives, and the relations between them, are made accessible. This talk will explore the representation of spatial relations in the Canonical set using a new algorithm (i.e., the Scaffolding algorithm) for approximating Voronoi diagrams with generalized attractors (e.g., line, curve, areas). Time permitting, examples of using this representation to develop set-based algorithms which incorporate machine learning to support the analysis of documents from different domains will be given.