Skip to main content
Back to Standards
Croissant Format Specification logo

Croissant Format Specification

A metadata format for machine learning datasets developed by MLCommons that standardizes how data is described, discovered, and loaded across ML frameworks. Built as a vocabulary on top of Schema.org, Croissant defines three layers of description: dataset-level metadata (name, license, creators), resource-level organization (files and file sets), and record-level structure (fields, data types, and semantic mappings). Encoded in JSON-LD, Croissant metadata enables dataset search engines to index datasets and ML tools to load them programmatically, addressing discoverability, portability, reproducibility, and responsible AI challenges.

Overview

Croissant is a metadata format for machine learning datasets that bridges the gap between how datasets are stored and how ML tools need to consume them. Developed by MLCommons and published as version 1.0 in March 2024, it extends Schema.org's Dataset vocabulary with ML-specific constructs for describing file organization, data structure, and semantic types. Major dataset platforms -- Hugging Face, Kaggle, Google Dataset Search, and OpenML -- adopted Croissant at launch, making it a de facto interoperability layer for ML dataset metadata.

Background

The machine learning community has long struggled with dataset interoperability. Datasets hosted across different platforms used incompatible metadata formats, and researchers routinely spent more time on data preparation than on model development. MLCommons, the open engineering consortium behind the MLPerf benchmark suite, convened a working group in 2023 to develop a common metadata format. The group brought together researchers and engineers from Google, Meta, King's College London, OpenML, and other institutions. The resulting specification was authored by Omar Benjelloun, Elena Simperl, Pierre Marcenac, Pierre Ruyssen, and others.

Purpose and Scope

Croissant addresses four challenges in the ML dataset ecosystem:

  • Discoverability -- Croissant metadata enables dataset search engines to parse and index datasets regardless of where they are published.
  • Portability and Reproducibility -- The format provides sufficient information for ML tools to load a dataset with just a few lines of code, and because the format is standardized, any Croissant-compliant tool interprets the data identically.
  • Responsible AI -- The specification includes a modular RAI (Responsible AI) extension vocabulary that captures dataset provenance, labeling processes, and safety-relevant metadata.
  • Extensibility -- Croissant is designed for community extensions addressing specific data modalities (audio, video) and domains (geospatial, life sciences, cultural heritage).

Key Components

The specification defines a three-layer architecture:

Layer Description Key Classes
Dataset-level General metadata: name, description, license, creators, URL, citation Schema.org Dataset properties
Resources Physical file organization: individual files and file collections FileObject, FileSet
RecordSets Logical data structure: fields, types, joins, splits RecordSet, Field, DataSource

Required Dataset Properties

The specification mandates @context, @type, dct:conformsTo, name, description, license, url, creator, and datePublished at the dataset level. Recommended properties include keywords, publisher, version, dateCreated, dateModified, and inLanguage.

Data Types

Croissant supports atomic types (Boolean, Date, Float, Integer, Text) and semantic types (ImageObject, BoundingBox, Split). Types from external vocabularies such as Wikidata can also be used, enabling domain-specific semantics.

Versioning

The specification adopts semantic versioning (MAJOR.MINOR.PATCH) for datasets and provides guidance on live datasets with continuously updated data, including checksum management for evolving files.

Serializations and Technical Formats

Croissant metadata is encoded in JSON-LD and embedded in web pages following the Schema.org pattern. The vocabulary namespace is http://mlcommons.org/croissant/ (abbreviated cr), and the specification version URI is http://mlcommons.org/croissant/1.0. The format also relies on the Schema.org (sc), Dublin Core Terms (dct), and Wikidata (wd) namespaces.

Governance and Maintenance

Croissant is maintained by MLCommons as an open standard under the Apache 2.0 license. The specification, tools, and example datasets are hosted on GitHub. A Python library (mlcroissant) provides a reference implementation for reading, validating, and loading Croissant-described datasets. A visual editor is available on Hugging Face Spaces for creating and editing Croissant metadata without writing JSON-LD by hand.

Notable Implementations

  • Hugging Face generates Croissant metadata for hosted datasets
  • Kaggle embeds Croissant metadata in dataset pages
  • Google Dataset Search indexes Croissant files for discovery
  • OpenML exports dataset metadata in Croissant format
  • The mlcroissant Python library loads any Croissant-described dataset into PyTorch, TensorFlow, or JAX

Related Standards

  • Schema.org provides the base vocabulary that Croissant extends with ML-specific properties
  • Dublin Core Terms supplies the dct:conformsTo property used for spec version declaration
  • DCAT provides catalog-level vocabulary for dataset discovery that complements Croissant's structural descriptions

Further Reading