Cellosaurus is a comprehensive, freely accessible knowledge resource on cell lines maintained by the SIB Swiss Institute of Bioinformatics. It provides standardized nomenclature, detailed provenance information, and extensive cross-referencing for tens of thousands of human and animal cell lines used in biomedical research. As the most thorough cell line catalog available, it has become an essential tool for researchers seeking to identify, authenticate, and properly cite cell lines in scientific publications.
Background
Cell lines are fundamental tools in biological and medical research, yet their management has long been plagued by problems of misidentification, contamination, and inconsistent naming. Hundreds of research papers have been retracted or questioned due to the use of misidentified cell lines. The Cellosaurus project was initiated to address this problem by creating a single authoritative resource that documents every known cell line with its correct identity, origin, and history. Developed within the framework of the neXtProt project and hosted on the ExPASy bioinformatics resource portal, Cellosaurus has grown from a focused reference list into a comprehensive knowledge base covering over 150,000 cell lines.
Purpose & Scope
Cellosaurus catalogs cell lines from a wide range of species and tissue types, with particularly deep coverage of human cell lines used in cancer research, immunology, and drug development. For each cell line entry, the resource provides:
- A unique accession number (CVCL identifier) for unambiguous reference
- Recommended name and synonyms
- Species of origin and disease association
- Cross-references to over 100 external databases and resources
- Literature references
- Provenance and authentication data, including STR profiles
- Information on known problematic cell lines (contaminated, misidentified)
The resource explicitly flags cell lines known to be contaminated or misidentified, helping researchers avoid using compromised materials.
Data Model
| Field | Description |
|---|---|
| Accession | Unique CVCL identifier (e.g., CVCL_0030 for HeLa) |
| Name | Recommended cell line name |
| Synonyms | Alternative names and identifiers |
| Species | Species of origin |
| Disease | Associated disease or condition |
| Cross-references | Links to external databases (ATCC, DSMZ, RIKEN, etc.) |
| STR Profile | Short tandem repeat authentication data |
| Comments | Provenance, contamination warnings, and other notes |
Serializations & Technical Formats
Cellosaurus data is available in multiple formats for computational use. The primary distribution formats are an OBO flat file and an XML export, both available via FTP from the ExPASy server. A web-based search interface and REST API provide programmatic access to individual records. The data is also represented in RDF for integration with Semantic Web resources.
Governance & Maintenance
Cellosaurus is maintained by the CALIPHO group at SIB Swiss Institute of Bioinformatics, led by Amos Bairoch. The resource is updated regularly with new cell line entries, corrections, and expanded cross-references. It is released under a Creative Commons Attribution 4.0 license, allowing free reuse with attribution. Major cell line repositories such as ATCC, DSMZ, JCRB, and RIKEN collaborate by providing data feeds.
Notable Implementations
Cellosaurus accession numbers are increasingly used as standard identifiers in scientific journals. Several publishers and funding agencies recommend citing cell lines using Cellosaurus identifiers. The resource is cross-referenced by major biomedical databases including UniProt, ChEMBL, and the Catalogue of Somatic Mutations in Cancer (COSMIC). It is also registered in FAIRsharing as a recognized bioinformatics resource.
Related Standards
- OBO Foundry -- Cellosaurus uses OBO format for one of its distribution files
- MIRIAM/Identifiers.org -- Cellosaurus accession numbers are registered as a MIRIAM data type