Skip to main content
Back to Standards
Analyzed Layout and Text Object logo

Analyzed Layout and Text Object

ALTO

By LC

An open XML schema for describing the layout and textual content of pages in digitized documents, typically generated by optical character recognition (OCR) software. ALTO encodes the position, size, and style of text blocks, text lines, individual words (strings), and non-textual elements such as illustrations and margins. Originally developed by the EU-funded METAe project, ALTO has been maintained by the Library of Congress since 2009 and is commonly used alongside METS for describing complete digitized objects.

Overview

ALTO (Analyzed Layout and Text Object) is an XML schema for encoding the layout and textual content of digitized document pages, most commonly generated as output from optical character recognition (OCR) software. It captures the precise position, size, and typographic style of every text element on a page, along with structural features such as margins, columns, headings, and illustrations. ALTO is one of the core standards in the digital library and cultural heritage preservation ecosystem.

Background

ALTO was originally developed as part of the EU-funded METAe (META-data engine) project, which ran from 2000 to 2003 and aimed to automate the encoding of digitized texts. Version 1.0 of the schema was released in June 2004, with subsequent versions (through 1.4) maintained by Content Conversion Specialists (CCS) GmbH in Hamburg, Germany. In August 2009, stewardship of the standard was transferred to the Library of Congress, which established a dedicated editorial board to oversee its continued development. The standard has since progressed to the 4.x series, adding support for reading order, glyph-level encoding, and enhanced typographic features.

Purpose & Scope

An ALTO file represents a single page of a digitized document and consists of three major sections within the root <alto> element:

  • Description -- metadata about the ALTO file itself, including measurement units, source image information, and processing history (OCR engine, parameters, confidence)
  • Styles -- definitions of text styles (font family, size, color) and paragraph styles (alignment, spacing) referenced by content elements
  • Layout -- the actual page content, organized into <Page> elements containing spatial regions: <TopMargin>, <LeftMargin>, <RightMargin>, <BottomMargin>, and <PrintSpace>

Within the print space, content is further organized into <TextBlock>, <TextLine>, and <String> elements, each carrying coordinate attributes (HPOS, VPOS, WIDTH, HEIGHT) that precisely locate the element within the page image.

Key Elements

Element Description
<Page> A single page with dimensions and physical structure
<PrintSpace> The area containing printed content
<TextBlock> A block of text (paragraph, column, etc.)
<TextLine> A single line of text within a block
<String> A word or token with coordinates and OCR confidence
<Illustration> A non-textual graphical region
<GraphicalElement> Lines, borders, and decorative elements

Serializations & Technical Formats

ALTO is defined as an XML Schema (XSD). Files use the .xml extension and are typically associated with the ALTO namespace. The schema is published by the Library of Congress and versioned alongside the standard.

Governance & Maintenance

Since 2009, ALTO has been maintained by the Library of Congress with guidance from an editorial board comprising representatives from major digital library initiatives and OCR software vendors. Development and schema files are hosted on GitHub under the altoxml organization. Changes go through the editorial board review process.

Notable Implementations

  • ABBYY FineReader -- commercial OCR software that can produce ALTO output
  • Tesseract -- the open-source OCR engine supports ALTO export
  • Transkribus -- handwritten text recognition platform produces ALTO
  • eScriptorium -- open-source HTR/OCR platform with ALTO support
  • Kitodo -- digital library workflow management software uses ALTO
  • Europeana Newspapers -- large-scale digitized newspaper project using METS/ALTO
  • HathiTrust -- the digital library preserves page-level ALTO alongside METS structural metadata

Related Standards

ALTO is most commonly used in combination with METS (Metadata Encoding and Transmission Standard), which provides the structural map linking ALTO page files into a complete digitized object. It is related to hOCR (an HTML-based OCR output format) and PAGE XML (an alternative page analysis and ground-truth format). PREMIS provides preservation metadata that may accompany ALTO files in digital preservation workflows.

Further Reading