The Provenance, Authoring and Versioning (PAV) ontology provides a practical set of RDF properties for recording who created, contributed to, curated, and versioned a resource. Originally developed for biomedical data management, PAV addresses a gap left by more general provenance standards by distinguishing between different types of contribution and providing straightforward versioning properties that are immediately useful in real-world data pipelines.
Background
Provenance tracking is critical in scientific data, where knowing who produced, reviewed, and modified a dataset determines its trustworthiness. The W3C PROV-O ontology provides a comprehensive provenance model, but its generality makes it verbose for common authoring and versioning patterns. PAV was developed to complement PROV-O with lightweight, directly usable properties for the most common provenance assertions. The ontology originated in work on the AlzSWAN project for Alzheimer's disease research and was further refined within the Open PHACTS drug discovery platform. A formal description was published in the Journal of Biomedical Semantics in 2013.
Purpose & Scope
PAV distinguishes between several types of contribution that are often conflated in simpler metadata:
| Property | Meaning |
|---|---|
pav:authoredBy |
The agent who created the original intellectual content |
pav:createdBy |
The agent who created this particular digital representation |
pav:curatedBy |
The agent who reviewed and validated the content |
pav:contributedBy |
An agent who contributed to the content without primary authorship |
pav:createdWith |
The software tool used to create the resource |
pav:importedFrom |
The source from which the resource was imported |
pav:retrievedFrom |
The URL from which the resource was downloaded |
pav:version |
A version identifier string |
pav:previousVersion |
Links to the preceding version of the resource |
This granularity matters in scientific contexts. A dataset might be authored by a researcher, created by a conversion script, curated by a data steward, and imported from a public database -- and PAV can capture all of these distinctions.
Governance & Maintenance
PAV is maintained by its developer community, with the ontology source hosted on GitHub. Version 2.3.1, published in 2014, is the current release. PAV is designed to be used alongside W3C PROV-O; it imports and specializes several PROV-O properties, ensuring compatibility with the broader provenance ecosystem.
Notable Implementations
PAV is widely adopted in biomedical and life sciences linked data. The Open PHACTS drug discovery platform uses PAV extensively for tracking data provenance across integrated pharmaceutical datasets. The nanopublications community uses PAV properties for attribution in nanopublication provenance graphs. It is also used in the Linked Data for Production (LD4P) project in academic libraries and in various biodiversity data platforms.
Related Standards
- PROV-O -- The W3C provenance ontology that PAV complements and specializes
- Dublin Core Terms -- General-purpose metadata terms that PAV extends with finer-grained authoring distinctions