Skip to main content
Back to Standards
TextMD: Technical Metadata for Text logo

TextMD: Technical Metadata for Text

TextMD

By LC

An XML schema designed to describe the technical characteristics of text-based digital objects. TextMD captures properties such as character encoding, markup language, markup basis, font information, page ordering, and line layout. Developed and maintained by the Library of Congress, it is commonly used as a METS extension schema within digital preservation workflows to complement content-level descriptive metadata.

Overview

TextMD is an XML schema developed by the Library of Congress for recording the technical characteristics of text-based digital objects. It captures encoding, markup, and typographic information that is essential for the long-term preservation and rendering of digital text. Like AudioMD and VideoMD, TextMD is designed to serve as a METS extension schema, providing the domain-specific technical detail that METS itself does not prescribe.

Background

TextMD was developed by the Library of Congress in the mid-2000s to fill a gap in digital preservation metadata for text-based content. While schemas like MODS and Dublin Core describe what a text resource is about, and PREMIS captures preservation actions and events, TextMD focuses on the technical characteristics needed to correctly render and preserve the text itself. This includes character encoding, markup language, font information, and page layout -- details that become critical when migrating content across systems or formats over time.

Purpose & Scope

TextMD captures the technical properties of text-based digital objects, including both born-digital text and digitized text derived from scanning. It is intended for use within digital preservation systems and institutional repositories. The schema addresses the encoding characteristics, markup basis, and presentational properties of text content.

Key Elements / Properties

Element Description
encoding Character encoding scheme (e.g., UTF-8, ASCII, ISO-8859-1)
markup_basis Markup language used (e.g., XML, SGML, HTML)
markup_language Specific markup schema or DTD applied
processingNote Notes on processing applied to the text
pageOrder Reading order of pages
lineLayout Arrangement of lines on a page
font Font information for rendered text
textNote General notes about the text object

Serializations & Technical Formats

TextMD is defined as an XML Schema (XSD). Instances are XML documents, typically embedded within METS <techMD> sections. The schema is available for download from the Library of Congress website.

Governance & Maintenance

TextMD is maintained by the Library of Congress. Updates are published on the LC standards website. The schema has been relatively stable since its initial versions, with minor revisions to accommodate evolving text formats and encoding practices.

Notable Implementations

TextMD is used within the Library of Congress's digital preservation workflows and by institutions that manage large-scale text digitization projects. It appears in METS profiles created for digital newspapers, digitized books, and born-digital text collections. Digital preservation frameworks that support METS extension schemas can accommodate TextMD.

Related Standards

  • METS -- The Metadata Encoding and Transmission Standard, which serves as the primary container for TextMD metadata.
  • PREMIS -- The preservation metadata standard, addressing broader preservation concerns alongside TextMD's technical focus.
  • ALTO -- Analyzed Layout and Text Object, which captures OCR results and page layout at a more granular level.
  • MODS -- Metadata Object Description Schema, which handles descriptive (rather than technical) metadata for text resources.

Further Reading

Resources & Links

Specification

Serialization