Taxonomy Guide
Written by Brian Bushnell
Last updated August 15, 2017


BBTools contains a taxonomy package designed for processing taxonomy information, using taxonomy files from NCBI.
The related tools allow you to rename, sort, filter or bin annotated sequences by taxonomy, use Seal to classify read abundance at a specific taxonomic level, annotate Sketch files with taxonomic information, and so forth.


*Notes*


Acquiring taxonomy data:

There is now a shell script, bbmap/pipelines/fetchTaxonomy.sh, which automates fethching and formatting taxonomy information from NCBI.
The simple procedure is to run fetchtaxonomy.sh in some directory (say, /usr/tax/) and subsequently add the flag "taxpath=/usr/tax/" to programs that need taxonomy data such as gi2taxid.sh.
If fetchtaxonomy.sh completes successfully with no error messages, producing files like "tree.taxtree.gz", you can skip the rest of this "Acquiring taxonomy data" section.

The taxonomy files must be downloaded from NCBI.  The are currently available here:
ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz

Accession number information is at:
ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/*.gz

In Linux, it is most convenient to fetch them like this:
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_*.dmp.gz
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/*.gz

Before further use, these must be processed by gitable.sh and taxtree.sh.  taxdmp.zip needs to be unzipped first.  For more details see the Usage Examples section.


Acquiring sequence data:

NCBI is constantly rearranging its site, but currently, you can get sequence data here:
ftp://ftp.ncbi.nlm.nih.gov/genomes/
And specifically, ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/all.fna.tar.gz contains all of the RefSeq bacterial data.  The current version should be available here: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/*.fna*, though that's inconvenient as it's in many files.  There are also other clades in ftp://ftp.ncbi.nlm.nih.gov/refseq/release/


Sequence naming conventions:

The only requirement of sequence data is that each header uses one of the following formats:

1) It starts with an NCBI taxonomic identifier, in this format (where "123" is the taxID):
tid|123|other stuff
This is my preferred format since you don't have to look things up in huge tables, so it's the fastest and uses the least memory.

2) It starts with a gi number, in this format (where "123" is the gi number):
gi|123|other stuff
This is the old naming convention used by NCBI data.

3) It starts with an accession, in this format (where "AC1234.1" is the accession, followed by a space):
AC1234.1 other stuff
This is the current naming convention used by NCBI data.

4) It is a proper Genus_species pair, with whitespace optionally replaced by underscores, like this:
Homo_sapiens
This is not recommended because there could be a name collision, but in general, it should be fine.  It is not strictly necessary to replace whitespace with underscores, but that is often convenient.  For taxa above the species level, just a single taxonomic name is fine, such as "Gammaproteobacteria".  Sequence names are case-insensitive.
The tool will try to use best effort, so usually things like just the species name, or "family genus species" will work too, but it is unpredictable due to odd naming conventions for some organisms.

5) Silva data should have a mysterious string (like GCVF01000431.1.2369) that is ignored, followed by a space, then semicolon-delimited taxonomic information.  It only works in Silva mode.
GCVF01000431.1.2369 Bacteria;Proteobacteria;Gammaproteobacteria;Oceanospirillales;Alcanivoraceae;Alcanivorax;Thalassiosira rotula

6) IMG data should start with an IMG number followed by a space (this can be accomplished with renameimg.sh):
2547132486 EuzAGGAGA464_NODE_7_len_49232_cov_91_7699_ID_1693950.7 464_Euzebyaceae_AGG_AGA : EuzAGGAGA464_NODE_7_len_49232_cov_91_7699_ID_1693950.7
or
img|2547132486| other stuff
or
tid|123|img|2547132486| other stuff


Renaming and Sorting

The first goal in many cases is to change data from one of the random supported formats into format 1 so that sequences are clearly identified by their taxID.  After downloading and formatting taxonomy files, this can be done with gi2taxid.sh.
For example, nt can be renamed like this (where "/path/to/taxonomy" is wherever you ran fetchTaxonomy.sh):

gi2taxid.sh -Xmx63g in=nt.fa.gz out=renamed.fa.gz tree=auto accession=auto table=auto ow shrinknames taxpath=/path/to/taxonomy

Sorting is optional and puts related organisms nearby; this can be useful for conserving memory in some cases (like when making sketches).


Specifying processed taxonomy files:

Most taxonomy-aware tools (Seal, FilterByTaxa, BBSketch, etc) require the path to the tree and/or gitable and/or accessions and/or IMG translation tables to be specified in the command line.  The default locations of processed taxonomy files are hard-coded in TaxTree.DefaultTreeFile and DefaultTableFile, currently (on Genepool, JGI's cluster) at /global/projectb/sandbox/gaag/bbtools/tax/current.  So, on Genepool, you can use the flag "tree=auto" instead of the full path, and the default will be used.
For non-JGI users the full path can be used (e.g. "tree=/usr/tax/tree.taxtree.gz") or auto can be used while changing the taxpath (e.g. "tree=auto accession=auto taxpath=/usr/tax/").
Some tools default to "auto"; this can be changed.  For example, if a tool is automatically trying to load accessions (which use a ton of RAM and take several minutes) and you don't want it to, set "accession=null".


Memory and Threads:

Since NCBI's move to accessions, some Taxonomy-related tools use many threads and a lot of memory.  Loading the tax tree, git tables, and accessions will use over 40GB RAM and take up to 10 minutes.  However, if all sequences are annotated with the tax ID as in format 1 (above), only the tax tree needs to be loaded, which takes a couple seconds and a few 100 MB of RAM.


*Usage Examples*


Creating a TaxTree file:
taxtree.sh names.dmp nodes.dmp tree.taxtree.gz

names.dmp and nodes.dmp come from unzipping taxdmp.zip (see "Acquiring taxonomy data").  The TaxTree file is needed by every tool processing taxonomy.  Note that this command's flags are order-sensitive.


Creating a GiTable file:
gitable.sh gi_taxid_nucl.dmp.gz gitable.int1d.gz

or, if you want protein and nucleotide GI numbers:
gitable.sh gi_taxid_nucl.dmp.gz,gi_taxid_prot.dmp.gz gitable.int1d.gz

gi_taxid_nucl.dmp.gz and gi_taxid_prot.dmp.gz come from NCBI (see "Acquiring taxonomy data").  The GiTable is needed by tools processing taxonomy, but only if reference sequences are named with gi numbers (e.g. gi|123|stuff).  Note that this command's flags are order-sensitive.


Renaming gi numbers to NCBI Tax ID numbers:
gi2taxid.sh in=bacteria.fa out=renamed.fa gi=gitable.int1d.gz

This is an optional step, but will rename sequences such as "gi|123|stuff" to "ncbi|456|stuff".  Replacing the gi number with a Tax ID number means in subsequent steps the gitable is no longer needed, which is more efficient.


Filtering sequences by taxonomy, according to sequence names:
filterbytaxa.sh in=bacteria.fa out=filtered.fa tree=tree.taxtree.gz table=gitable.int1d.gz names=Escherichia_coli level=phylum include=t

This will create a file, "filtered.fa", containing all the sequences in the same phylum as E.coli.  It is also possible to use numeric taxonomic IDs with the "ids" flag, or create a file containing everything except E.coli's phylum using the "include=f" flag.  Gitable is not needed unless the sequences are named with gi numbers.


Binning sequences by taxonomy, according to sequence names:
splitbytaxa.sh in=bacteria.fa out=%.fa tree=tree.taxtree.gz table=gitable.int1d.gz level=phylum

This will split the file into many output files, such as Deinococcus-Thermus.fa and Bacteroidetes.fa.  For taxonomic binning based on sequence content rather than names, see Seal.  Gitable is not needed unless the sequences are named with gi numbers.


Printing the taxonomic ancestry of a taxa:
taxonomy.sh tree=tree.taxtree.gz table=gitable.int1d.gz homo_sapiens meiothermus_ruber 123 gi_123

This will print the complete taxonomy of Homo sapiens, Meiothermus ruber, and the organism has a sequence with an NCBI identifier of 123 (Pirellula), and the organism with a gi number of 123 (Bos taurus).  Note that an underscore was used instead of the vertical line (gi_123) because vertical line is a reserved symbol for piping, so it's annoying to use on the command line.  Also note that "123" or "ncbi_123" indicate a taxonomy number, while "gi_123" indicates a gi number.  Gitable is not needed unless you enter gi numbers.


Running TaxServer:

This is currently a public HTTP service from JGI at https://taxonomy.jgi-psf.org.  But to run a private server:

taxserver.sh -Xmx45g tree=tree.taxtree.gz gi=gitable.int1d.gz accession=prot.accession2taxid.gz,nucl_wgs.accession2taxid.gz port=1234 1>log.txt 2>&1 &

That will run a background process waiting for http requests on port 1234.  On your computer you could view it like this:

http://localhost:1234/tax/

That would give usage information.  Queries are of the form:

http://localhost:1234/tax/accession/T02634.1
http://localhost:1234/tax/gi/1234
http://localhost:1234/tax/name/homo_sapiens

If you only need gi number and name lookups, the default 9 GB (or even 7GB) is plenty of RAM.  But for accession lookups, over 40GB is required.  You can include as few or as many accession2taxid tables as desired; fewer take less memory, and NCBI has around 9 of them currently.
