Croissant is a metadata format for machine learning datasets that bridges the gap between how datasets are stored and how ML tools need to consume them. Developed by MLCommons and published as version 1.0 in March 2024, it extends Schema.org's Dataset vocabulary with ML-specific constructs for describing file organization, data structure, and semantic types. Major dataset platforms -- Hugging Face, Kaggle, Google Dataset Search, and OpenML -- adopted Croissant at launch, making it a de facto interoperability layer for ML dataset metadata.
Background
The machine learning community has long struggled with dataset interoperability. Datasets hosted across different platforms used incompatible metadata formats, and researchers routinely spent more time on data preparation than on model development. MLCommons, the open engineering consortium behind the MLPerf benchmark suite, convened a working group in 2023 to develop a common metadata format. The group brought together researchers and engineers from Google, Meta, King's College London, OpenML, and other institutions. The resulting specification was authored by Omar Benjelloun, Elena Simperl, Pierre Marcenac, Pierre Ruyssen, and others.
Purpose and Scope
Croissant addresses four challenges in the ML dataset ecosystem:
- Discoverability -- Croissant metadata enables dataset search engines to parse and index datasets regardless of where they are published.
- Portability and Reproducibility -- The format provides sufficient information for ML tools to load a dataset with just a few lines of code, and because the format is standardized, any Croissant-compliant tool interprets the data identically.
- Responsible AI -- The specification includes a modular RAI (Responsible AI) extension vocabulary that captures dataset provenance, labeling processes, and safety-relevant metadata.
- Extensibility -- Croissant is designed for community extensions addressing specific data modalities (audio, video) and domains (geospatial, life sciences, cultural heritage).
Key Components
The specification defines a three-layer architecture:
| Layer | Description | Key Classes |
|---|---|---|
| Dataset-level | General metadata: name, description, license, creators, URL, citation | Schema.org Dataset properties |
| Resources | Physical file organization: individual files and file collections | FileObject, FileSet |
| RecordSets | Logical data structure: fields, types, joins, splits | RecordSet, Field, DataSource |
Required Dataset Properties
The specification mandates @context, @type, dct:conformsTo, name, description, license, url, creator, and datePublished at the dataset level. Recommended properties include keywords, publisher, version, dateCreated, dateModified, and inLanguage.
Data Types
Croissant supports atomic types (Boolean, Date, Float, Integer, Text) and semantic types (ImageObject, BoundingBox, Split). Types from external vocabularies such as Wikidata can also be used, enabling domain-specific semantics.
Versioning
The specification adopts semantic versioning (MAJOR.MINOR.PATCH) for datasets and provides guidance on live datasets with continuously updated data, including checksum management for evolving files.
Serializations and Technical Formats
Croissant metadata is encoded in JSON-LD and embedded in web pages following the Schema.org pattern. The vocabulary namespace is http://mlcommons.org/croissant/ (abbreviated cr), and the specification version URI is http://mlcommons.org/croissant/1.0. The format also relies on the Schema.org (sc), Dublin Core Terms (dct), and Wikidata (wd) namespaces.
Governance and Maintenance
Croissant is maintained by MLCommons as an open standard under the Apache 2.0 license. The specification, tools, and example datasets are hosted on GitHub. A Python library (mlcroissant) provides a reference implementation for reading, validating, and loading Croissant-described datasets. A visual editor is available on Hugging Face Spaces for creating and editing Croissant metadata without writing JSON-LD by hand.
Notable Implementations
- Hugging Face generates Croissant metadata for hosted datasets
- Kaggle embeds Croissant metadata in dataset pages
- Google Dataset Search indexes Croissant files for discovery
- OpenML exports dataset metadata in Croissant format
- The
mlcroissantPython library loads any Croissant-described dataset into PyTorch, TensorFlow, or JAX
Related Standards
- Schema.org provides the base vocabulary that Croissant extends with ML-specific properties
- Dublin Core Terms supplies the
dct:conformsToproperty used for spec version declaration - DCAT provides catalog-level vocabulary for dataset discovery that complements Croissant's structural descriptions