Name: Croissant Format Specification
Creator: MLCommons
License: Apache-2.0
Keywords: scientific-data, web

Croissant is a metadata format for machine learning datasets that bridges the gap between how datasets are stored and how ML tools need to consume them. Developed by MLCommons and published as version 1.0 in March 2024, it extends Schema.org's Dataset vocabulary with ML-specific constructs for describing file organization, data structure, and semantic types. Major dataset platforms -- Hugging Face, Kaggle, Google Dataset Search, and OpenML -- adopted Croissant at launch, making it a de facto interoperability layer for ML dataset metadata.

Background

The machine learning community has long struggled with dataset interoperability. Datasets hosted across different platforms used incompatible metadata formats, and researchers routinely spent more time on data preparation than on model development. MLCommons, the open engineering consortium behind the MLPerf benchmark suite, convened a working group in 2023 to develop a common metadata format. The group brought together researchers and engineers from Google, Meta, King's College London, OpenML, and other institutions. The resulting specification was authored by Omar Benjelloun, Elena Simperl, Pierre Marcenac, Pierre Ruyssen, and others.

Purpose and Scope

Croissant addresses four challenges in the ML dataset ecosystem:

Discoverability -- Croissant metadata enables dataset search engines to parse and index datasets regardless of where they are published.
Portability and Reproducibility -- The format provides sufficient information for ML tools to load a dataset with just a few lines of code, and because the format is standardized, any Croissant-compliant tool interprets the data identically.
Responsible AI -- The specification includes a modular RAI (Responsible AI) extension vocabulary that captures dataset provenance, labeling processes, and safety-relevant metadata.
Extensibility -- Croissant is designed for community extensions addressing specific data modalities (audio, video) and domains (geospatial, life sciences, cultural heritage).

Key Components

The specification defines a three-layer architecture:

Layer	Description	Key Classes
Dataset-level	General metadata: name, description, license, creators, URL, citation	Schema.org Dataset properties
Resources	Physical file organization: individual files and file collections	`FileObject`, `FileSet`
RecordSets	Logical data structure: fields, types, joins, splits	`RecordSet`, `Field`, `DataSource`

Required Dataset Properties

The specification mandates @context, @type, dct:conformsTo, name, description, license, url, creator, and datePublished at the dataset level. Recommended properties include keywords, publisher, version, dateCreated, dateModified, and inLanguage.

Data Types

Croissant supports atomic types (Boolean, Date, Float, Integer, Text) and semantic types (ImageObject, BoundingBox, Split). Types from external vocabularies such as Wikidata can also be used, enabling domain-specific semantics.

Versioning

The specification adopts semantic versioning (MAJOR.MINOR.PATCH) for datasets and provides guidance on live datasets with continuously updated data, including checksum management for evolving files.

Serializations and Technical Formats

Croissant metadata is encoded in JSON-LD and embedded in web pages following the Schema.org pattern. The vocabulary namespace is http://mlcommons.org/croissant/ (abbreviated cr), and the specification version URI is http://mlcommons.org/croissant/1.0. The format also relies on the Schema.org (sc), Dublin Core Terms (dct), and Wikidata (wd) namespaces.

Governance and Maintenance

Croissant is maintained by MLCommons as an open standard under the Apache 2.0 license. The specification, tools, and example datasets are hosted on GitHub. A Python library (mlcroissant) provides a reference implementation for reading, validating, and loading Croissant-described datasets. A visual editor is available on Hugging Face Spaces for creating and editing Croissant metadata without writing JSON-LD by hand.

Notable Implementations

Hugging Face generates Croissant metadata for hosted datasets
Kaggle embeds Croissant metadata in dataset pages
Google Dataset Search indexes Croissant files for discovery
OpenML exports dataset metadata in Croissant format
The mlcroissant Python library loads any Croissant-described dataset into PyTorch, TensorFlow, or JAX

Related Standards

Schema.org provides the base vocabulary that Croissant extends with ML-specific properties
Dublin Core Terms supplies the dct:conformsTo property used for spec version declaration
DCAT provides catalog-level vocabulary for dataset discovery that complements Croissant's structural descriptions

Croissant Format Specification

Overview

Background

Purpose and Scope

Key Components

Required Dataset Properties

Data Types

Versioning

Serializations and Technical Formats

Governance and Maintenance

Notable Implementations

Related Standards

Further Reading

Resources & Links

Specification

Namespace URI

Documentation

Repository

Example

Other

Related Standards