OpenLineage is an open framework for collecting and analyzing data lineage metadata, providing a standardized API that data pipeline components can use to report information about datasets, jobs, and runs. Hosted under the Linux Foundation, it aims to create a consistent, vendor-neutral approach to understanding how data flows through complex data infrastructure.
Background
Data lineage — tracking where data comes from, how it transforms, and where it goes — has become essential as organizations build increasingly complex data pipelines. Before OpenLineage, lineage information was typically locked within individual tools, making cross-platform lineage tracking difficult or impossible. OpenLineage was created in 2021 to address this fragmentation by defining an open standard that any pipeline component can implement, regardless of vendor or technology stack.
Purpose & Scope
OpenLineage defines a standard API for capturing lineage events that occur during data processing. When a job runs, it emits events describing which datasets it reads from and writes to, along with metadata about the run itself (start time, end time, status, errors). These events flow to a compatible backend for storage and analysis. The standard covers three core entities: datasets (collections of data), jobs (processes that transform data), and runs (individual executions of jobs).
Core Concepts
| Concept | Description |
|---|---|
| Dataset | A named collection of data, with a namespace and name |
| Job | A process that reads from and/or writes to datasets |
| Run | A single execution of a job, with start/complete/fail events |
| Facet | Extensible metadata attached to datasets, jobs, or runs |
| LineageEvent | An event emitted when a run changes state (START, COMPLETE, FAIL, etc.) |
Serializations & Technical Formats
The OpenLineage API specification is defined using OpenAPI and transmits events as JSON payloads. The specification uses a facet-based extensibility model where standard and custom metadata can be attached to core entities. Client libraries are available for Python and Java, and a Javadoc documents the Java API surface.
Governance & Maintenance
OpenLineage is governed under the Linux Foundation with a Technical Steering Committee (TSC) that holds monthly open meetings. The project follows an open-source development model with all work conducted on GitHub. Contributions are welcome from both individual developers and vendor organizations.
Notable Implementations
The reference implementation is Marquez, an open-source metadata repository that stores and serves OpenLineage events. OpenLineage integrations exist for Apache Airflow, Apache Spark, dbt, Apache Flink, and other widely used data processing frameworks. The project has attracted participation from data tool vendors and organizations building modern data platforms.
Related Standards
- DCAT — Data Catalog Vocabulary, which addresses dataset cataloging where OpenLineage addresses lineage tracking