TextMD is an XML schema developed by the Library of Congress for recording the technical characteristics of text-based digital objects. It captures encoding, markup, and typographic information that is essential for the long-term preservation and rendering of digital text. Like AudioMD and VideoMD, TextMD is designed to serve as a METS extension schema, providing the domain-specific technical detail that METS itself does not prescribe.
Background
TextMD was developed by the Library of Congress in the mid-2000s to fill a gap in digital preservation metadata for text-based content. While schemas like MODS and Dublin Core describe what a text resource is about, and PREMIS captures preservation actions and events, TextMD focuses on the technical characteristics needed to correctly render and preserve the text itself. This includes character encoding, markup language, font information, and page layout -- details that become critical when migrating content across systems or formats over time.
Purpose & Scope
TextMD captures the technical properties of text-based digital objects, including both born-digital text and digitized text derived from scanning. It is intended for use within digital preservation systems and institutional repositories. The schema addresses the encoding characteristics, markup basis, and presentational properties of text content.
Key Elements / Properties
| Element | Description |
|---|---|
encoding |
Character encoding scheme (e.g., UTF-8, ASCII, ISO-8859-1) |
markup_basis |
Markup language used (e.g., XML, SGML, HTML) |
markup_language |
Specific markup schema or DTD applied |
processingNote |
Notes on processing applied to the text |
pageOrder |
Reading order of pages |
lineLayout |
Arrangement of lines on a page |
font |
Font information for rendered text |
textNote |
General notes about the text object |
Serializations & Technical Formats
TextMD is defined as an XML Schema (XSD). Instances are XML documents, typically embedded within METS <techMD> sections. The schema is available for download from the Library of Congress website.
Governance & Maintenance
TextMD is maintained by the Library of Congress. Updates are published on the LC standards website. The schema has been relatively stable since its initial versions, with minor revisions to accommodate evolving text formats and encoding practices.
Notable Implementations
TextMD is used within the Library of Congress's digital preservation workflows and by institutions that manage large-scale text digitization projects. It appears in METS profiles created for digital newspapers, digitized books, and born-digital text collections. Digital preservation frameworks that support METS extension schemas can accommodate TextMD.
Related Standards
- METS -- The Metadata Encoding and Transmission Standard, which serves as the primary container for TextMD metadata.
- PREMIS -- The preservation metadata standard, addressing broader preservation concerns alongside TextMD's technical focus.
- ALTO -- Analyzed Layout and Text Object, which captures OCR results and page layout at a more granular level.
- MODS -- Metadata Object Description Schema, which handles descriptive (rather than technical) metadata for text resources.