Name: OpenLineage
Creator: The Linux Foundation
License: Apache-2.0
Keywords: scientific-data, web

Overview

OpenLineage is an open framework for collecting and analyzing data lineage metadata, providing a standardized API that data pipeline components can use to report information about datasets, jobs, and runs. Hosted under the Linux Foundation, it aims to create a consistent, vendor-neutral approach to understanding how data flows through complex data infrastructure.

Background

Data lineage — tracking where data comes from, how it transforms, and where it goes — has become essential as organizations build increasingly complex data pipelines. Before OpenLineage, lineage information was typically locked within individual tools, making cross-platform lineage tracking difficult or impossible. OpenLineage was created in 2021 to address this fragmentation by defining an open standard that any pipeline component can implement, regardless of vendor or technology stack.

Purpose & Scope

OpenLineage defines a standard API for capturing lineage events that occur during data processing. When a job runs, it emits events describing which datasets it reads from and writes to, along with metadata about the run itself (start time, end time, status, errors). These events flow to a compatible backend for storage and analysis. The standard covers three core entities: datasets (collections of data), jobs (processes that transform data), and runs (individual executions of jobs).

Core Concepts

Concept	Description
Dataset	A named collection of data, with a namespace and name
Job	A process that reads from and/or writes to datasets
Run	A single execution of a job, with start/complete/fail events
Facet	Extensible metadata attached to datasets, jobs, or runs
LineageEvent	An event emitted when a run changes state (START, COMPLETE, FAIL, etc.)

Serializations & Technical Formats

The OpenLineage API specification is defined using OpenAPI and transmits events as JSON payloads. The specification uses a facet-based extensibility model where standard and custom metadata can be attached to core entities. Client libraries are available for Python and Java, and a Javadoc documents the Java API surface.

Governance & Maintenance

OpenLineage is governed under the Linux Foundation with a Technical Steering Committee (TSC) that holds monthly open meetings. The project follows an open-source development model with all work conducted on GitHub. Contributions are welcome from both individual developers and vendor organizations.

Notable Implementations

The reference implementation is Marquez, an open-source metadata repository that stores and serves OpenLineage events. OpenLineage integrations exist for Apache Airflow, Apache Spark, dbt, Apache Flink, and other widely used data processing frameworks. The project has attracted participation from data tool vendors and organizations building modern data platforms.

Related Standards

DCAT — Data Catalog Vocabulary, which addresses dataset cataloging where OpenLineage addresses lineage tracking

Resources & Links

Specification

OpenLineage OpenAPI Specification

Documentation

Related Standards

Data Catalog Vocabulary (DCAT)

World Wide Web Consortium

ontology

OpenLineage

Overview

Background

Purpose & Scope

Core Concepts

Serializations & Technical Formats

Governance & Maintenance

Notable Implementations

Related Standards

Further Reading

Resources & Links

Specification

Documentation

Repository

Community / Forum

Other

Related Standards