The Web ARChive (WARC) format is a container file format designed for storing the results of web crawls: HTTP requests, responses, metadata, and related resources bundled into a single archival file. Standardized as ISO 28500, WARC is the backbone of large-scale web archiving operations at institutions like the Internet Archive, national libraries, and research organizations worldwide.

Background

Web archiving emerged as a critical preservation activity in the mid-1990s as it became clear that web content was ephemeral and frequently disappeared without trace. The Internet Archive, founded by Brewster Kahle in 1996, developed the ARC (ARChive) format for its Wayback Machine crawls. While functional, ARC had limitations: it could only store HTTP responses (not requests), lacked support for metadata records, and had no provision for recording revisit information or duplicates. The International Internet Preservation Consortium (IIPC), formed in 2003, led the effort to create a successor format. WARC was published as ISO 28500:2009 and revised as ISO 28500:2017.

Purpose & Scope

WARC files encapsulate complete web interactions for long-term preservation. A single WARC file can contain multiple record types: full HTTP request-response pairs, metadata annotations, conversion records (for format migrations), and continuation records for resources split across files. This comprehensive approach ensures that archived content can be replayed faithfully, preserving not just the page content but the full context of how it was retrieved.

Key Record Types

Record Type	Description
warcinfo	Describes the WARC file itself (creator, date, parameters)
response	Complete HTTP response including headers and payload
request	Complete HTTP request including headers
metadata	Metadata about another record in the WARC file
revisit	Indicates content identical to a previously archived record
resource	A resource not obtained via HTTP (e.g., locally generated)
conversion	A transformed version of another record
continuation	Continuation of a record that was too large for one segment

Serializations & Technical Formats

WARC files are binary container files with a .warc extension (or .warc.gz when compressed with gzip). Each record within the file has a plain-text header block followed by a content block. The header uses a format similar to HTTP headers, with named fields like WARC-Type, WARC-Date, WARC-Record-ID, and Content-Length. Individual records are separated by a double newline. WARC files can grow to arbitrary sizes, though common practice limits them to 1 GB for practical handling.

Governance & Maintenance

The WARC specification is maintained by the International Internet Preservation Consortium (IIPC) and published through ISO. The IIPC hosts the specification development on GitHub, where the community can propose changes and discuss issues. ISO 28500:2009 defined WARC 1.0; ISO 28500:2017 updated it to version 1.1 with clarifications and minor extensions. The IIPC's Technical Committee oversees ongoing maintenance and coordinates with ISO for formal revisions.

Notable Implementations

The Internet Archive uses WARC as its primary storage format, holding petabytes of archived web content accessible through the Wayback Machine. National libraries operating web archiving programs — including the British Library, Library of Congress, Bibliotheque nationale de France, and the National Library of Australia — store their crawl data in WARC format. The Heritrix web crawler (developed by the Internet Archive), wget (with its --warc-file option), and Browsertrix Crawler all produce WARC output. Tools like pywb and OpenWayback replay archived content from WARC files, and warcio provides programmatic access in Python.

Related Standards

No directly related standards are currently indexed.

Web ARChive Format

Overview

Background

Purpose & Scope

Key Record Types

Serializations & Technical Formats

Governance & Maintenance

Notable Implementations

Related Standards

Further Reading

Resources & Links

Specification

Documentation

Repository

Other