Processing recommendations¶
Currently, Qiita supports the processing of raw data from:
Target gene barcoded sequencing
Shotgun sequencing
Metatranscriptome sequencing
Genome isolate sequencing
Note that the selected processing recommendations are mainly guided towards performing meta-analyses, this is combine different studies, even from different wet lab techniques or sequencing technologies. However, these parameters shouldn’t prevent you using the resulting tables as your primary analytical source.
Target gene barcoded sequencing¶
For this you can start with raw, pre-demultiplexing data or per-sample FASTQ. Either way, you will need to “Split libraries and QC”, which uses the default in QIIME 1.9.1. Once your demultiplexed and QCed artifact is created you need to select which processing to perform. There are two main ideologies/methodologies to process target gene data: sequence clustering and sequence deblur.
Sequencing deblur (preferred)¶
For this we use deblur. Here 2 BIOM tables are generated by default: deblur final table and deblur reference hit table. The former is the full biom table, which can be used with any target gene and wetlab work; the latter is the trimmed version to those sequences that match Greengenes at 80% similarity, a really basic and naive filtering. Each of those BIOM tables, is accompanied by a FASTA that contains the representative sequences. The OTU IDs are given by the unique sequence.
Note that deblur needs all sequences to be trimmed at the same length, thus the recommended pipeline is to trim everything at 150bp and the deblur.
Sequencing clustering¶
Here we use close reference picking, for an explanation of the different picking methods see “Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences”. Here we generate a single BIOM table with the OTUs/per-sample. The OTU IDs are given based on the reference database selected.
Currently, we have the reference databases: Greengenes version 3_8-97, Silva 119 and Unite 7. Depending on your selection is if the reference has a phylogenetic tree.
The Sequencing Processing Pipeline (SPP)¶
The Knight Lab Sequencing Processing Pipeline is provided by the qp-klp plugin and it takes the internal BCL files, converts them to per-sample-FASTQ and does the QC and human-host sequencing filtering before the files are linked into the preparation. In other words, this is the Knight Lab pre-processing and automatic loading Metagenomic pipeline. Note that this is only available to Admins and only to process internally sequenced data.
Historically the Knight Lab have used these filtering steps:
atropos & bowtie2 against GRCh38.p7 + Phi X 174
fastp & minimap2 against GRCh38.p7 + Phi X 174
fastp & minimap2 against GRCh38.p7 + Phi X 174 via SPP
fastp & minimap2 against GRCh38.p14 + Phi X 174 + T2T-CHM13v2.0
fastp & minimap2 against GRCh38.p14 + Phi X 174 + T2T-CHM13v2.0 via SPP
fastp & minimap2 run in paired end and single end mode against GRCh38.p14 + Phi X 174 + T2T-CHM13v2.0 + Human Pangenome Reference Consortium release 2023
fastp & minimap2 run in paired end and single end mode against GRCh38.p14 + Phi X 174 + T2T-CHM13v2.0 + Human Pangenome Reference Consortium release 2023 via SPP
fastp & minimap2 against GRCh38.p14 + Phi X 174 + T2T-CHM13v2.0, then Movi against GRCh38.p14, T2T-CHM13v2.0 + Human Pangenome Reference Consortium release 2023
fastp & minimap2 against GRCh38.p14 + Phi X 174 + T2T-CHM13v2.0, then Movi against GRCh38.p14, T2T-CHM13v2.0 + Human Pangenome Reference Consortium release 2023 via SPP
As part of the 2024.07 release, for easier tracking, we have added this information to the database and linked to the artifacts and preparations. The goal for the near future is to use the latest filtering on all the available data.
References:
Shotgun sequencing¶
Qiita currently has one active shotgun metagenomics data analysis pipeline: a per sample, paired-end bowtie2 alignment step with Woltka classification using either the WoLr2 (default) or RS210 databases. Below you will find more information about each of these options.
Note
The bowtie2 settings are set for interleaved processing with a maximum and minimum mismatch penalties (mp=[1,1]), a penalty for ambiguities (np=1; default), read and reference gap open and extend penalties (rdg=[0,1], rfg=[0,1]), a minimum alignment score for an alignment to be considered valid (score-min=[L,0,-0.05]), a defined number of distinct, valid alignments (k=16), and the suppression of SAM records for unaligned reads, as well as SAM headers (no-unal, no-hd), and using end-to-end alignments before using the multiseed heuristic (no-exact-upfront, no-1mm-upfront). More information visit:
The current workflow is as follows:
A single step per sample adapter removal (via fastp) and host filtering (via minimap2); more information below.
Taxonomy profiling using bowtie2 as an aligner and two different reference databases; see sections below
Note that we recommend only uploading sequences that have already been through QC and human sequence removal. However, we recommend that all sequence files go through adapter and host filtering within the system to ensure they are ready for subsequent meta-analyses. We currently provide the several options for your convenience. For each the fastp command is set to autodetect and remove universal adapter sequences (i.e., ‘GATCGGAAGAGCACACGTCTGAACTCCAGTCAC’ for R1 reads and ‘GATCGGAAGAGCGTCGTGTAGGGAAAGGAGTGT’ for R2 reads). We also provide the following host reference genomes for filtering against; each also filters against three phi x sequences (i.e., HM753704.1, JF719728.1, J02482.1):
auto-detect adapters and artifacts + phix filtering: This is a deblur artifacts reference, mainly for debugging and testing. Includes another adapter sequence (i.e., ‘ATCTCGTATGCCGTCTTCTGC’).
auto-detect adapters and cheetah + phix filtering. Includes cheetah (Acinonyx jubatus) reference GCF_003709585.1 (Aci_jub_2). GCF_003709585.1 fna
auto-detect adapters and cow + phix filtering. Includes cow (Bos taurus) reference GCF_000003205.7 (Btau_5.0.1). GCF_000003205.7 fna
auto-detect adapters and hamster + phix filtering. Includes golden hamster (Mesocricetus auratus) reference GCF_017639785.1 (BCM_Maur_2.0). GCF_017639785.1 fna
auto-detect adapters and horse + phix filtering. Includes horse (Equus caballus) reference GCF_000002305.2 (EquCab2.0). GCF_000002305.2 fna
auto-detect adapters and merge_genomes + phix filtering. Includes the genomes of cheetah, cow, hamster, horse, mouse, pig, rabbit, and rat described here.
auto-detect adapters and mouse + phix filtering. Includes house mouse (Mus musculus) reference GCF_000001635.27 (GRCm39). GCF_000001635.27 fna
auto-detect adapters and pig + phix filtering. Includes pig (Sus scrofa) reference GCF_000003025.6 (Sscrofa11.1). GCF_000003025.6 fna
auto-detect adapters and rabbit + phix filtering. Includes rabbit (Oryctolagus cuniculus) reference GCF_000003625.3 (OryCun2.0). GCF_000003625.3 fna
auto-detect adapters and rat + phix filtering. Includes Norway rat (Rattus norvegicus) reference GCF_000001895.5 (Rnor_6.0). GCF_000001895.5 fna
auto-detect adapters only filtering. Only includes the two adapter sequences noted above.
For more information about the versions in this plugin, visit:
Note that the command produces up to 5 output artifacts based on the aligner and database selected:
Per genome Predictions: contains the raw alignment file and the per genome level predictions BIOM table
Per gene Predictions: Only WoLr2, contains the per gene level predictions BIOM table
KEGG Pathways: Only WoLr2, contains the functional profile
KEGG Ontology (KO): Only WoLr2, contains the functional profile
KEGG Enzyme (EZ): Only WoLr2, contains the functional profile
Note
Woltka 0.1.6 only produces per-genome, per-gene and functional profiles as we are moving to Operational Genomic Units (OGUs), which have higher resolution than taxonomic units for community ecology, and were shown to deliver stronger biological signals in downstream analyses. For more information please read: Phylogeny-Aware Analysis of Metagenome Community Ecology Based on Matched Reference Genomes while Bypassing Taxonomy. To work on lower taxonomic levels (like species or genus) you can follow these instructions and use this lineages.txt file with your collapse command.
Aligners¶
Note that some of these are legacy option but not available for new processing.
Bowtie2: The classical ultrafast short sequence aligner. Based on-FM indexing of genome sequences to achieve efficient memory and CPU performance. We tuned the parameter setting for Bowtie2 to achieve optimal alignment accuracy for typical shotgun metagenome datasets.
Version: 2.4.2
Alignment file format: SAM
Website: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
Citation: Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359.
BURST: A novel sequence aligner featuring exhaustive (in contrast to heuristic) alignment against the entire reference genome database, hence achieving the highest accuracy, though with relatively long running time.
Version: 0.99.8
Alignment file format: b6o (BLAST tabular output, i.e., -outfmt 6)
Website: https://github.com/knights-lab/BURST
Citation: Gabriel Al-Ghalith and Dan Knights. BURST enables optimal exhaustive DNA alignment for big data. DOI 2017:doi.org/10.5281/zenodo.806850
Note: Manuscript under review.
UTree A sequence classifier based on _k_-mer signature matching along a tree structure. Analogous to Kraken but with higher computational efficiency. The fastest option.
Version: 2.0 RF
Alignment file format: custom mapping file
Website: https://github.com/knights-lab/UTree
Citation: Gabriel Al-Ghalith and Dan Knights. Faster and lower-memory metagenomic profiling with UTree. DOI: 10.5281/zenodo.998252
Reference databases¶
Note that some of these are legacy option but not available for new processing.
WoLr2 (“Web of Life” release 2): A significant upgrade from WoLr1. The genome pool is an even representation of microbial diversity, sampled from non-redundant bacterial and archaeal genomes from NCBI (RefSeq and GenBank, complete and draft). A high-quality reference phylogeny was reconstructed using the uDance workflow (manuscript in submission). Taxonomic classifications were curated according to phylogeny based on GTDB (default) and NCBI. Functional annotations were performed using EggNOG, GO, KEGG, MetaCyc, Pfam and UniRef.
Domains: Bacteria, Archaea
Number of genomes: 15,953
Total length (bp): 48,809,171,826
Citation: Zhu Q, Mai U, Pfeiffer W, et al. Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea. Nat Commun. 2019. 10(1):5477. doi: 10.1038/s41467-019-13443-4.
Numbers of taxonomic units:
Domains: 2
Phyla: 124
Classes: 321
Orders: 914
Families: 2,057
Genera: 6,811
Species: 12,258
RS210: Collection of reference microbial genomes sampled from the NCBI RefSeq genome database, as of 2022-01-01. This time point corresponds to RefSeq release 210.
Genomes: 29,648
Nucleotides: 926,894
Basepairs: 111,767,286,504 (includes linkers)
Numbers of taxonomic units:
Archaea: 606
Bacteria: 21,047
Fungi: 409
Protozoa: 93
Viral: 7,493
WoLr1 (“Web of Life” release 1): An even representation of microbial diversity, selected using an prototype selection algorithm based on the MinHash distance matrix among all non-redundant bacterial and archaeal genomes from NCBI (RefSeq and GenBank, complete and draft), plus several genome quality control criteria. A high-quality reference phylogeny is available for this genome pool, enabling subsequent phylogeny-based analyses. Also available are curated taxonomic annotations, based on NCBI and GTDB systems.
Domains: Bacteria, Archaea
Number of genomes: 10,575
Total length (bp): 32,861,886,373
Citation: Zhu Q, Mai U, Pfeiffer W, et al. Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea. Nat Commun. 2019. 10(1):5477. doi: 10.1038/s41467-019-13443-4.
Numbers of taxonomic units:
Kingdoms: 2
Phyla: 146
Classes: 89
Orders: 196
Families: 422
Genera: 2,081
Species: 9,105
Strains: 89
Note: Nucleotide sequences per genome were concatenated with a linker of 20 “N”s.
Rep200: NCBI representative and reference microbial genomes, corresponding to RefSeq release 200 (2020-05-14)
Genomes: 11,955
Nucleotides: 926,894
Basepairs: 62,823,581,921 (excluding gaps)
Numbers of taxonomic units:
Archaea: 419
Bacteria: 11080
Fungi: 320
Protozoa: 88
Viral: 48
Rep94: NCBI representative and reference microbial genomes, corresponding to RefSeq release 94.
Domains: Bacteria, Archaea
Number of genomes: 5,808
Total length (bp): 23,165,526,011
Note: Nucleotide sequences per genome were concatenated with a linker of 20 “N”s.
Numbers of taxonomic units:
Kingdoms: 2
Phyla: 38
Classes: 85
Orders: 186
Families: 427
Genera: 1,931
Species: 5,636
Strains: 84
Rep82: NCBI representative and reference microbial genomes, corresponding to RefSeq release 82.
Not available anymore for new processing
Domains: Bacteria, Archaea, Viruses/Viroids
Number of genomes: 10,519
Total length (bp): 20,387,349,319
Note: Plasmids were isolated from bacterial and archaeal host genomes and considered as separate genomes.
Numbers of taxonomic units:
Kingdoms: 6
Phyla: 55
Classes: 362
Orders: 182
Families: 452
Genera: 2,264
Species: 11,852
Strains: 4,263
Metatranscriptome processing¶
Qiita currently has one active Metatranscriptome data analysis pipeline, as follows:
Ribosomal read filtering via SortMeRNA; details below. This produces a Ribosomal reads and a Non-ribosomal reads artifact/
Sequence profiling via Woltka; for more information see details above.
Sample processing guidelines for metatranscriptomic data¶
Ribosomal read filtering¶
SortMeRNA is used for removal of ribosomal reads from quality filtered Metatranscriptome data
Latest SortMeRNA version: v2.1
Input: Quality filtered Metatranscriptome reads (FASTA/FASTQ) Ribosomal reads are identified by searching against pre-curated rRNA databases. Currently, rRNA databases covering bacteria, archaea and eukarya were downloaded and indexed from SILVA and Rfam. Currently indexed databases and their clustering ids:
silva-bacterial-16s-id 90%
silva-bacterial-23s-id 98%
silva-archaeal-16s-id 95%
silva-archaeal-23s-id 98%
silva-eukarya-18s-id 95%
silva-eukarya-28s-id 98%
rfam-5s-database-id 98%
rfam-5.8s-database-id 98%
The above databases and ID cut-offs were chosen to work with a range of samples including more diverse/complex environmental samples.
Building Custom databases¶
Custom databases can also be built in addition to the above mentioned databases. Custom databases can be built by using the using the ARB package to extract FASTA files for:
16S bacteria, 16S archaea and 18S eukarya using SSURef_NR99_119_SILVA_14_07_14_opt.arb
23S bacteria, 23S archaea and 28S eukarya using LSURef_119_SILVA_15_07_14_opt.arb
The built databases will then have to be indexed before running SortMeRNA. Reference database(s) and their corresponding indexes separated by “,” and multiple databases are separated by “:”
SortMeRNA Usage¶
SortMeRNA filters the ribosomal from the non-ribosomal reads from the input sample dataset (via BLAST search)and outputs two fasta/q files containing the ribosomal and non-ribosomal reads respectively. Additionally, a summary file showing the proportion of reads matching to each of the screened ribosomal databases can also be made available. Default options have been set to report only the best alignment per read reaching E-value. For non ribo-depleted samples (i.e. total RNA), the ribosomal reads obtained from SortMeRNA can be further used in taxonomic/compositional analysis. In the case of ribo-depleted samples, only the non-ribosomal reads are used in downstream analyses such as assembly, mapping, differential gene abundance analyses etc.
Genome Isolate Processing¶
This workflow can be used for assembling (meta)-genomes (isolate and/or metagenomic data) using SPAdes v3.15.2 at set k-mer lengths of 21,33,55,77,99 and 127.
The assembled contigs are stored in per sample FASTA files (originally scaffolds.fna in SPAdes).
The –merge option merges the forward and reverse reads prior to assembly (preferable for isolate or metagenomes with high sequencing depth), the non-merge option works well for shallow shotgun data and/or complex environmental communities.
The –meta flag is used to assemble metagenomic datasets.