The Representation and Analysis of Document Images

                              Mark Burge
                    Department of Systems Science
               Johannes Kepler University Linz, Austria
 

Document image analysis is the process of extracting the primitives in
an image of a document and recreating the document's structure in a
machine understandable format. It is a very application driven field,
and systems are usually designed for a specific collection of documents
from a single domain. This talk presents a framework which supports the
construction of algorithms for the analysis of document images from
different domains. It introduces the Canonical set, a domain independent
representation for document images in which their high level primitives,
and the relations between them, are made accessible.  This talk will
explore the representation of spatial relations in the Canonical
set using a new algorithm (i.e., the Scaffolding algorithm) for
approximating Voronoi diagrams with generalized attractors (e.g., line,
curve, areas). Time permitting, examples of using this representation
to develop set-based algorithms which incorporate machine learning to
support the analysis of documents from different domains will be given.