Skip to main content
Back to Standards
Web ARChive Format logo

Web ARChive Format

WARC

A file format for aggregating multiple digital resources into a single archive file together with related metadata. Standardized as ISO 28500, WARC extends the earlier ARC format developed by the Internet Archive and is maintained by the International Internet Preservation Consortium (IIPC). WARC files encapsulate complete HTTP transactions — requests, responses, headers, and content — enabling faithful preservation and replay of archived web content.

Overview

The Web ARChive (WARC) format is a container file format designed for storing the results of web crawls: HTTP requests, responses, metadata, and related resources bundled into a single archival file. Standardized as ISO 28500, WARC is the backbone of large-scale web archiving operations at institutions like the Internet Archive, national libraries, and research organizations worldwide.

Background

Web archiving emerged as a critical preservation activity in the mid-1990s as it became clear that web content was ephemeral and frequently disappeared without trace. The Internet Archive, founded by Brewster Kahle in 1996, developed the ARC (ARChive) format for its Wayback Machine crawls. While functional, ARC had limitations: it could only store HTTP responses (not requests), lacked support for metadata records, and had no provision for recording revisit information or duplicates. The International Internet Preservation Consortium (IIPC), formed in 2003, led the effort to create a successor format. WARC was published as ISO 28500:2009 and revised as ISO 28500:2017.

Purpose & Scope

WARC files encapsulate complete web interactions for long-term preservation. A single WARC file can contain multiple record types: full HTTP request-response pairs, metadata annotations, conversion records (for format migrations), and continuation records for resources split across files. This comprehensive approach ensures that archived content can be replayed faithfully, preserving not just the page content but the full context of how it was retrieved.

Key Record Types

Record Type Description
warcinfo Describes the WARC file itself (creator, date, parameters)
response Complete HTTP response including headers and payload
request Complete HTTP request including headers
metadata Metadata about another record in the WARC file
revisit Indicates content identical to a previously archived record
resource A resource not obtained via HTTP (e.g., locally generated)
conversion A transformed version of another record
continuation Continuation of a record that was too large for one segment

Serializations & Technical Formats

WARC files are binary container files with a .warc extension (or .warc.gz when compressed with gzip). Each record within the file has a plain-text header block followed by a content block. The header uses a format similar to HTTP headers, with named fields like WARC-Type, WARC-Date, WARC-Record-ID, and Content-Length. Individual records are separated by a double newline. WARC files can grow to arbitrary sizes, though common practice limits them to 1 GB for practical handling.

Governance & Maintenance

The WARC specification is maintained by the International Internet Preservation Consortium (IIPC) and published through ISO. The IIPC hosts the specification development on GitHub, where the community can propose changes and discuss issues. ISO 28500:2009 defined WARC 1.0; ISO 28500:2017 updated it to version 1.1 with clarifications and minor extensions. The IIPC's Technical Committee oversees ongoing maintenance and coordinates with ISO for formal revisions.

Notable Implementations

The Internet Archive uses WARC as its primary storage format, holding petabytes of archived web content accessible through the Wayback Machine. National libraries operating web archiving programs — including the British Library, Library of Congress, Bibliotheque nationale de France, and the National Library of Australia — store their crawl data in WARC format. The Heritrix web crawler (developed by the Internet Archive), wget (with its --warc-file option), and Browsertrix Crawler all produce WARC output. Tools like pywb and OpenWayback replay archived content from WARC files, and warcio provides programmatic access in Python.

Related Standards

No directly related standards are currently indexed.

Further Reading