The Web ARChive (WARC) format is a container file format designed for storing the results of web crawls: HTTP requests, responses, metadata, and related resources bundled into a single archival file. Standardized as ISO 28500, WARC is the backbone of large-scale web archiving operations at institutions like the Internet Archive, national libraries, and research organizations worldwide.
Background
Web archiving emerged as a critical preservation activity in the mid-1990s as it became clear that web content was ephemeral and frequently disappeared without trace. The Internet Archive, founded by Brewster Kahle in 1996, developed the ARC (ARChive) format for its Wayback Machine crawls. While functional, ARC had limitations: it could only store HTTP responses (not requests), lacked support for metadata records, and had no provision for recording revisit information or duplicates. The International Internet Preservation Consortium (IIPC), formed in 2003, led the effort to create a successor format. WARC was published as ISO 28500:2009 and revised as ISO 28500:2017.
Purpose & Scope
WARC files encapsulate complete web interactions for long-term preservation. A single WARC file can contain multiple record types: full HTTP request-response pairs, metadata annotations, conversion records (for format migrations), and continuation records for resources split across files. This comprehensive approach ensures that archived content can be replayed faithfully, preserving not just the page content but the full context of how it was retrieved.
Key Record Types
| Record Type | Description |
|---|---|
| warcinfo | Describes the WARC file itself (creator, date, parameters) |
| response | Complete HTTP response including headers and payload |
| request | Complete HTTP request including headers |
| metadata | Metadata about another record in the WARC file |
| revisit | Indicates content identical to a previously archived record |
| resource | A resource not obtained via HTTP (e.g., locally generated) |
| conversion | A transformed version of another record |
| continuation | Continuation of a record that was too large for one segment |
Serializations & Technical Formats
WARC files are binary container files with a .warc extension (or .warc.gz when compressed with gzip). Each record within the file has a plain-text header block followed by a content block. The header uses a format similar to HTTP headers, with named fields like WARC-Type, WARC-Date, WARC-Record-ID, and Content-Length. Individual records are separated by a double newline. WARC files can grow to arbitrary sizes, though common practice limits them to 1 GB for practical handling.
Governance & Maintenance
The WARC specification is maintained by the International Internet Preservation Consortium (IIPC) and published through ISO. The IIPC hosts the specification development on GitHub, where the community can propose changes and discuss issues. ISO 28500:2009 defined WARC 1.0; ISO 28500:2017 updated it to version 1.1 with clarifications and minor extensions. The IIPC's Technical Committee oversees ongoing maintenance and coordinates with ISO for formal revisions.
Notable Implementations
The Internet Archive uses WARC as its primary storage format, holding petabytes of archived web content accessible through the Wayback Machine. National libraries operating web archiving programs — including the British Library, Library of Congress, Bibliotheque nationale de France, and the National Library of Australia — store their crawl data in WARC format. The Heritrix web crawler (developed by the Internet Archive), wget (with its --warc-file option), and Browsertrix Crawler all produce WARC output. Tools like pywb and OpenWayback replay archived content from WARC files, and warcio provides programmatic access in Python.
Related Standards
No directly related standards are currently indexed.
ISO