The NCBI Taxonomy Database is the authoritative taxonomic classification system used across all sequence databases maintained by the National Center for Biotechnology Information. It provides a single, curated hierarchy of organism names and classifications that underpins GenBank, RefSeq, PubMed, and every other NCBI resource, making it one of the most heavily referenced biological vocabularies in existence.
Background
The NCBI Taxonomy Database was developed in the early 1990s as part of NCBI's mission to provide integrated access to molecular biology information. As nucleotide and protein sequence databases grew, there was a critical need for a consistent taxonomic framework to organize sequences by organism. Rather than adopting any single existing taxonomic authority, the NCBI Taxonomy group curates a synthetic classification that draws on published taxonomic literature while maintaining internal consistency across all NCBI databases.
The database is continuously updated as new organisms are sequenced and as taxonomic revisions are published. It currently contains over 2.5 million taxa, representing approximately 10% of the described species of life on the planet.
Purpose & Scope
The NCBI Taxonomy serves as the standard reference for organism classification within the INSDC (International Nucleotide Sequence Database Collaboration), which includes GenBank (USA), the European Nucleotide Archive (ENA), and the DNA Data Bank of Japan (DDBJ). Every sequence submitted to these databases must be associated with a valid NCBI Taxonomy identifier (TaxID).
The taxonomy covers all domains of life — Bacteria, Archaea, and Eukaryota — as well as viruses and unclassified sequences. It includes scientific names, common names, synonyms, and a hierarchical classification from superkingdom down to subspecies and strain level where applicable.
Key Features
| Feature | Description |
|---|---|
| Taxa | 2.5 million+ |
| Coverage | All organisms in INSDC sequence databases |
| Identifiers | Numeric TaxID (e.g., 9606 for Homo sapiens) |
| Updates | Continuous |
| Ranks | Domain through subspecies/strain |
Serializations & Technical Formats
The NCBI Taxonomy is available for bulk download from the NCBI FTP server in a flat-file dump format (taxdump). Individual records can be retrieved through the NCBI Taxonomy Browser web interface or programmatically via the Entrez E-Utilities API. The data is also integrated into the NCBI Datasets resource.
Governance & Maintenance
The NCBI Taxonomy is maintained by a dedicated curation team at the National Center for Biotechnology Information, part of the National Library of Medicine (NLM) at the US National Institutes of Health (NIH). Taxonomic updates are informed by published literature, submissions from sequence depositors, and consultation with domain experts. As a US government resource, the data is in the public domain.
Notable Implementations
The NCBI TaxID is used as the organism identifier in GenBank, RefSeq, UniProt, the Protein Data Bank, and thousands of other biological databases. It serves as the de facto standard for linking molecular data to organism identity in bioinformatics pipelines worldwide. The Common Tree tool allows users to generate phylogenetic trees for selected taxa.
Related Standards
- Darwin Core — biodiversity data standard that references taxonomic authorities including NCBI
NLM