Skip to main content
Back to Standards
Text Encoding Initiative Guidelines logo

Text Encoding Initiative Guidelines

TEI

A comprehensive XML-based framework for encoding machine-readable texts, primarily in the humanities, social sciences, and linguistics. The TEI Guidelines define approximately 500 elements and attributes for representing the structural, visual, and semantic features of texts including prose, verse, drama, spoken language, manuscripts, dictionaries, and linguistic corpora. Originating at the 1987 Vassar College conference, the standard has evolved through P3 (1994), P4 (2002, SGML to XML), and P5 (2007, modular ODD architecture), with maintenance releases at least twice yearly. The TEI Consortium received the Antonio Zampolli Prize in 2017.

Overview

The Text Encoding Initiative (TEI) is one of the longest-running and most widely adopted standards in the digital humanities. Continuously active since the 1980s, the TEI defines a comprehensive XML vocabulary of approximately 500 elements and attributes for representing texts of every kind, from medieval manuscripts and linguistic corpora to born-digital documents and critical editions. The TEI Consortium maintains the Guidelines, a journal, a wiki, a GitHub repository, and a complete toolchain, and was awarded the Antonio Zampolli Prize by the Alliance of Digital Humanities Organizations in 2017.

Background

The TEI originated at a 1987 planning conference at Vassar College, convened by scholars from the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. The resulting "Poughkeepsie Principles" directed development of the first standard. The first major release, TEI P3, appeared in 1994, co-edited by Lou Burnard (Oxford University) and Michael Sperberg-McQueen (University of Illinois at Chicago, later W3C). TEI P3 was updated in 1999.

TEI P4 (2002) made the critical transition from SGML to XML, with adoption of Unicode as required by XML parsers. TEI P5 (2007) introduced integration with W3C xml:lang and xml:id attributes, regularized pointing attributes to use the hash convention, unified the ptr and xptr tags, and established a fully modular architecture based on ODD (One Document Does it all). TEI P5 version 2.0.1 (2011) added support for genetic editing. Maintenance and feature update versions of TEI P5 have been released at least twice a year since 2007. The TEI Consortium was incorporated as a non-profit membership organization in 2000.

Purpose & Scope

The TEI Guidelines define a type of XML format that is primarily semantic rather than presentational: the semantics and interpretation of every tag and attribute are specified, each grounded in one or more academic disciplines with examples. The approximately 500 textual components and concepts include word, sentence, character, glyph, person, and many more. Modules cover:

  • Core and header — bibliographic description, text structure, paragraphs, lists, notes
  • Verse and drama — line groups, stage directions, speaker labels
  • Manuscript description — physical description, hands, scribal interventions, bindings
  • Critical apparatus — variant readings, witnesses, stemmatic relationships
  • Dictionaries — entries, senses, etymologies, usage examples
  • Linguistic annotation — tokenization, morphological features, syntactic structures
  • Names, dates, and places — named entity markup with normalization
  • Transcription of primary sources — additions, deletions, damage, editorial interventions
  • Figures and tables — graphics, formulae, tabular data
  • Non-hierarchical structures — options for representing overlapping markup

The Guidelines do not prescribe a single schema. Projects create customizations using Roma or ODD files, selecting the modules and elements relevant to their material. TEI Lite is a well-known example of such a customization.

ODD (One Document Does it all)

ODD is a literate programming language for XML schemas that combines human-readable documentation and machine-readable models using TEI Documentation Elements. Tools generate localized and internationalized HTML, EPUB, or PDF documentation and schemas in DTD, W3C XML Schema, RELAX NG Compact Syntax, or RELAX NG XML Syntax. The Roma web application is built around ODD and can generate schemas in all these formats. Although ODD files generally describe customized subsets of the full TEI model, ODD can also describe XML formats entirely separate from the TEI, such as the W3C's Internationalization Tag Set.

Known Customizations

  • EpiDoc — for epigraphy and papyrology
  • Charters Encoding Initiative (CEI) — for medieval charters
  • Medieval Nordic Text Archive (Menota) — for medieval Nordic texts

Serializations & Technical Formats

TEI documents are encoded in XML conforming to schemas generated from the Guidelines. The canonical schema format is RELAX NG, though W3C XML Schema and DTD versions are also produced. The TEI namespace URI is http://www.tei-c.org/ns/1.0. Transformation between TEI XML and other formats (HTML, PDF, EPUB, plain text) is supported by the TEI Stylesheets package and the TEIGarage conversion service.

Governance & Maintenance

The TEI Consortium is governed by a Board of Directors and a Technical Council. The Technical Council manages the Guidelines through a public GitHub repository, with releases issued at least twice yearly since 2007. Community input flows through Special Interest Groups (SIGs), working groups, the TEI-L mailing list, and annual conferences and members' meetings. The Guidelines are licensed under CC-BY-SA, ensuring open access and reuse.

Notable Projects

TEI is used by projects worldwide, practically all associated with universities:

Project Subject
British National Corpus 100-million-word snapshot of English usage
Oxford Text Archive 1GB+ of linguistic data in 25 languages
Perseus Project Greek and Latin texts
EpiDoc Epigraphy and papyrology
Women Writers Project Early modern women writers
FreeDict Bilingual dictionaries
Text Creation Partnership Early British and American books
CELT Ancient and medieval Irish manuscripts
ISTEX Archives of scientific publications

Related Standards

  • Dublin Core — general-purpose descriptive metadata; TEI headers often incorporate Dublin Core elements
  • EAD (Encoded Archival Description) — archival finding aid standard with complementary scope
  • MEI (Music Encoding Initiative) — music encoding standard that uses TEI-inspired approaches

Further Reading