Skip to main content
Back to Standards
OpenLineage logo

OpenLineage

By LF

An open framework for data lineage collection and analysis, providing a standard API for capturing lineage events across data pipeline components. OpenLineage tracks metadata about datasets, jobs, and runs, enabling users to identify the root cause of data quality issues and understand the impact of changes. The project includes a standard API specification, a reference implementation (Marquez), client libraries for common languages, and integrations with data pipeline tools such as Apache Airflow and Apache Spark.

Overview

OpenLineage is an open framework for collecting and analyzing data lineage metadata, providing a standardized API that data pipeline components can use to report information about datasets, jobs, and runs. Hosted under the Linux Foundation, it aims to create a consistent, vendor-neutral approach to understanding how data flows through complex data infrastructure.

Background

Data lineage — tracking where data comes from, how it transforms, and where it goes — has become essential as organizations build increasingly complex data pipelines. Before OpenLineage, lineage information was typically locked within individual tools, making cross-platform lineage tracking difficult or impossible. OpenLineage was created in 2021 to address this fragmentation by defining an open standard that any pipeline component can implement, regardless of vendor or technology stack.

Purpose & Scope

OpenLineage defines a standard API for capturing lineage events that occur during data processing. When a job runs, it emits events describing which datasets it reads from and writes to, along with metadata about the run itself (start time, end time, status, errors). These events flow to a compatible backend for storage and analysis. The standard covers three core entities: datasets (collections of data), jobs (processes that transform data), and runs (individual executions of jobs).

Core Concepts

Concept Description
Dataset A named collection of data, with a namespace and name
Job A process that reads from and/or writes to datasets
Run A single execution of a job, with start/complete/fail events
Facet Extensible metadata attached to datasets, jobs, or runs
LineageEvent An event emitted when a run changes state (START, COMPLETE, FAIL, etc.)

Serializations & Technical Formats

The OpenLineage API specification is defined using OpenAPI and transmits events as JSON payloads. The specification uses a facet-based extensibility model where standard and custom metadata can be attached to core entities. Client libraries are available for Python and Java, and a Javadoc documents the Java API surface.

Governance & Maintenance

OpenLineage is governed under the Linux Foundation with a Technical Steering Committee (TSC) that holds monthly open meetings. The project follows an open-source development model with all work conducted on GitHub. Contributions are welcome from both individual developers and vendor organizations.

Notable Implementations

The reference implementation is Marquez, an open-source metadata repository that stores and serves OpenLineage events. OpenLineage integrations exist for Apache Airflow, Apache Spark, dbt, Apache Flink, and other widely used data processing frameworks. The project has attracted participation from data tool vendors and organizations building modern data platforms.

Related Standards

  • DCAT — Data Catalog Vocabulary, which addresses dataset cataloging where OpenLineage addresses lineage tracking

Further Reading