Skip to main content
Back to Standards

Provenance, Authoring and Versioning Ontology

PAV

A lightweight ontology for tracking provenance, authoring, and versioning of web resources and scientific data. PAV distinguishes between content contributors, metadata creators, and curators, providing fine-grained properties such as pav:authoredBy, pav:curatedBy, pav:createdWith, and pav:version. Originally developed for use in biomedical and life sciences data, PAV is designed to complement W3C PROV-O by adding practical authoring and versioning information that PROV-O does not cover directly.

Overview

The Provenance, Authoring and Versioning (PAV) ontology provides a practical set of RDF properties for recording who created, contributed to, curated, and versioned a resource. Originally developed for biomedical data management, PAV addresses a gap left by more general provenance standards by distinguishing between different types of contribution and providing straightforward versioning properties that are immediately useful in real-world data pipelines.

Background

Provenance tracking is critical in scientific data, where knowing who produced, reviewed, and modified a dataset determines its trustworthiness. The W3C PROV-O ontology provides a comprehensive provenance model, but its generality makes it verbose for common authoring and versioning patterns. PAV was developed to complement PROV-O with lightweight, directly usable properties for the most common provenance assertions. The ontology originated in work on the AlzSWAN project for Alzheimer's disease research and was further refined within the Open PHACTS drug discovery platform. A formal description was published in the Journal of Biomedical Semantics in 2013.

Purpose & Scope

PAV distinguishes between several types of contribution that are often conflated in simpler metadata:

Property Meaning
pav:authoredBy The agent who created the original intellectual content
pav:createdBy The agent who created this particular digital representation
pav:curatedBy The agent who reviewed and validated the content
pav:contributedBy An agent who contributed to the content without primary authorship
pav:createdWith The software tool used to create the resource
pav:importedFrom The source from which the resource was imported
pav:retrievedFrom The URL from which the resource was downloaded
pav:version A version identifier string
pav:previousVersion Links to the preceding version of the resource

This granularity matters in scientific contexts. A dataset might be authored by a researcher, created by a conversion script, curated by a data steward, and imported from a public database -- and PAV can capture all of these distinctions.

Governance & Maintenance

PAV is maintained by its developer community, with the ontology source hosted on GitHub. Version 2.3.1, published in 2014, is the current release. PAV is designed to be used alongside W3C PROV-O; it imports and specializes several PROV-O properties, ensuring compatibility with the broader provenance ecosystem.

Notable Implementations

PAV is widely adopted in biomedical and life sciences linked data. The Open PHACTS drug discovery platform uses PAV extensively for tracking data provenance across integrated pharmaceutical datasets. The nanopublications community uses PAV properties for attribution in nanopublication provenance graphs. It is also used in the Linked Data for Production (LD4P) project in academic libraries and in various biodiversity data platforms.

Related Standards

  • PROV-O -- The W3C provenance ontology that PAV complements and specializes
  • Dublin Core Terms -- General-purpose metadata terms that PAV extends with finer-grained authoring distinctions

Further Reading