Processing recommendations

Currently, Qiita supports the processing of raw data from:

  1. Target gene barcoded sequencing

  2. Shotgun sequencing

  3. Metatranscriptome sequencing

  4. Genome isolate sequencing

Note that the selected processing recommendations are mainly guided towards performing meta-analyses, this is combine different studies, even from different wet lab techniques or sequencing technologies. However, these parameters shouldn’t prevent you using the resulting tables as your primary analytical source.

Target gene barcoded sequencing

For this you can start with raw, pre-demultiplexing data or per-sample FASTQ. Either way, you will need to “Split libraries and QC”, which uses the default in QIIME 1.9.1. Once your demultiplexed and QCed artifact is created you need to select which processing to perform. There are two main ideologies/methodologies to process target gene data: sequence clustering and sequence deblur.

Sequencing deblur (preferred)

For this we use deblur. Here 2 BIOM tables are generated by default: deblur final table and deblur reference hit table. The former is the full biom table, which can be used with any target gene and wetlab work; the latter is the trimmed version to those sequences that match Greengenes at 80% similarity, a really basic and naive filtering. Each of those BIOM tables, is accompanied by a FASTA that contains the representative sequences. The OTU IDs are given by the unique sequence.

Note that deblur needs all sequences to be trimmed at the same length, thus the recommended pipeline is to trim everything at 150bp and the deblur.

Sequencing clustering

Here we use close reference picking, for an explanation of the different picking methods see “Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences”. Here we generate a single BIOM table with the OTUs/per-sample. The OTU IDs are given based on the reference database selected.

Currently, we have the reference databases: Greengenes version 3_8-97, Silva 119 and Unite 7. Depending on your selection is if the reference has a phylogenetic tree.

The Sequencing Processing Pipeline (SPP)

The Knight Lab Sequencing Processing Pipeline is provided by the qp-klp plugin and it takes the internal BCL files, converts them to per-sample-FASTQ and does the QC and human-host sequencing filtering before the files are linked into the preparation. In other words, this is the Knight Lab pre-processing and automatic loading Metagenomic pipeline. Note that this is only available to Admins and only to process internally sequenced data.

Historically the Knight Lab have used these filtering steps:

  1. atropos & bowtie2 against GRCh38.p7 + Phi X 174

  2. fastp & minimap2 against GRCh38.p7 + Phi X 174

  3. fastp & minimap2 against GRCh38.p7 + Phi X 174 via SPP

  4. fastp & minimap2 against GRCh38.p14 + Phi X 174 + T2T-CHM13v2.0

  5. fastp & minimap2 against GRCh38.p14 + Phi X 174 + T2T-CHM13v2.0 via SPP

  6. fastp & minimap2 run in paired end and single end mode against GRCh38.p14 + Phi X 174 + T2T-CHM13v2.0 + Human Pangenome Reference Consortium release 2023

  7. fastp & minimap2 run in paired end and single end mode against GRCh38.p14 + Phi X 174 + T2T-CHM13v2.0 + Human Pangenome Reference Consortium release 2023 via SPP

  8. fastp & minimap2 against GRCh38.p14 + Phi X 174 + T2T-CHM13v2.0, then Movi against GRCh38.p14, T2T-CHM13v2.0 + Human Pangenome Reference Consortium release 2023

  9. fastp & minimap2 against GRCh38.p14 + Phi X 174 + T2T-CHM13v2.0, then Movi against GRCh38.p14, T2T-CHM13v2.0 + Human Pangenome Reference Consortium release 2023 via SPP

As part of the 2024.07 release, for easier tracking, we have added this information to the database and linked to the artifacts and preparations. The goal for the near future is to use the latest filtering on all the available data.

References:

Shotgun sequencing

Qiita currently has one active shotgun metagenomics data analysis pipeline: a per sample, paired-end bowtie2 alignment step with Woltka classification using either the WoLr2 (default) or RS210 databases. Below you will find more information about each of these options.

Note

The bowtie2 settings are set for interleaved processing with a maximum and minimum mismatch penalties (mp=[1,1]), a penalty for ambiguities (np=1; default), read and reference gap open and extend penalties (rdg=[0,1], rfg=[0,1]), a minimum alignment score for an alignment to be considered valid (score-min=[L,0,-0.05]), a defined number of distinct, valid alignments (k=16), and the suppression of SAM records for unaligned reads, as well as SAM headers (no-unal, no-hd), and using end-to-end alignments before using the multiseed heuristic (no-exact-upfront, no-1mm-upfront). More information visit:

The current workflow is as follows:

  1. A single step per sample adapter removal (via fastp) and host filtering (via minimap2); more information below.

  2. Taxonomy profiling using bowtie2 as an aligner and two different reference databases; see sections below

Note that we recommend only uploading sequences that have already been through QC and human sequence removal. However, we recommend that all sequence files go through adapter and host filtering within the system to ensure they are ready for subsequent meta-analyses. We currently provide the several options for your convenience. For each the fastp command is set to autodetect and remove universal adapter sequences (i.e., ‘GATCGGAAGAGCACACGTCTGAACTCCAGTCAC’ for R1 reads and ‘GATCGGAAGAGCGTCGTGTAGGGAAAGGAGTGT’ for R2 reads). We also provide the following host reference genomes for filtering against; each also filters against three phi x sequences (i.e., HM753704.1, JF719728.1, J02482.1):

For more information about the versions in this plugin, visit:

Note that the command produces up to 5 output artifacts based on the aligner and database selected:

  • Per genome Predictions: contains the raw alignment file and the per genome level predictions BIOM table

  • Per gene Predictions: Only WoLr2, contains the per gene level predictions BIOM table

  • KEGG Pathways: Only WoLr2, contains the functional profile

  • KEGG Ontology (KO): Only WoLr2, contains the functional profile

  • KEGG Enzyme (EZ): Only WoLr2, contains the functional profile

Note

Woltka 0.1.6 only produces per-genome, per-gene and functional profiles as we are moving to Operational Genomic Units (OGUs), which have higher resolution than taxonomic units for community ecology, and were shown to deliver stronger biological signals in downstream analyses. For more information please read: Phylogeny-Aware Analysis of Metagenome Community Ecology Based on Matched Reference Genomes while Bypassing Taxonomy. To work on lower taxonomic levels (like species or genus) you can follow these instructions and use this lineages.txt file with your collapse command.

Aligners

Note that some of these are legacy option but not available for new processing.

  1. Bowtie2: The classical ultrafast short sequence aligner. Based on-FM indexing of genome sequences to achieve efficient memory and CPU performance. We tuned the parameter setting for Bowtie2 to achieve optimal alignment accuracy for typical shotgun metagenome datasets.

  2. BURST: A novel sequence aligner featuring exhaustive (in contrast to heuristic) alignment against the entire reference genome database, hence achieving the highest accuracy, though with relatively long running time.

    • Version: 0.99.8

    • Alignment file format: b6o (BLAST tabular output, i.e., -outfmt 6)

    • Website: https://github.com/knights-lab/BURST

    • Citation: Gabriel Al-Ghalith and Dan Knights. BURST enables optimal exhaustive DNA alignment for big data. DOI 2017:doi.org/10.5281/zenodo.806850

    • Note: Manuscript under review.

  3. UTree A sequence classifier based on _k_-mer signature matching along a tree structure. Analogous to Kraken but with higher computational efficiency. The fastest option.

    • Version: 2.0 RF

    • Alignment file format: custom mapping file

    • Website: https://github.com/knights-lab/UTree

    • Citation: Gabriel Al-Ghalith and Dan Knights. Faster and lower-memory metagenomic profiling with UTree. DOI: 10.5281/zenodo.998252

Reference databases

Note that some of these are legacy option but not available for new processing.

  1. WoLr2 (“Web of Life” release 2): A significant upgrade from WoLr1. The genome pool is an even representation of microbial diversity, sampled from non-redundant bacterial and archaeal genomes from NCBI (RefSeq and GenBank, complete and draft). A high-quality reference phylogeny was reconstructed using the uDance workflow (manuscript in submission). Taxonomic classifications were curated according to phylogeny based on GTDB (default) and NCBI. Functional annotations were performed using EggNOG, GO, KEGG, MetaCyc, Pfam and UniRef.

    • Domains: Bacteria, Archaea

    • Number of genomes: 15,953

    • Total length (bp): 48,809,171,826

    • Citation: Zhu Q, Mai U, Pfeiffer W, et al. Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea. Nat Commun. 2019. 10(1):5477. doi: 10.1038/s41467-019-13443-4.

    • Numbers of taxonomic units:

      • Domains: 2

      • Phyla: 124

      • Classes: 321

      • Orders: 914

      • Families: 2,057

      • Genera: 6,811

      • Species: 12,258

  2. RS210: Collection of reference microbial genomes sampled from the NCBI RefSeq genome database, as of 2022-01-01. This time point corresponds to RefSeq release 210.

    • Genomes: 29,648

    • Nucleotides: 926,894

    • Basepairs: 111,767,286,504 (includes linkers)

    • Numbers of taxonomic units:

      • Archaea: 606

      • Bacteria: 21,047

      • Fungi: 409

      • Protozoa: 93

      • Viral: 7,493

  3. WoLr1 (“Web of Life” release 1): An even representation of microbial diversity, selected using an prototype selection algorithm based on the MinHash distance matrix among all non-redundant bacterial and archaeal genomes from NCBI (RefSeq and GenBank, complete and draft), plus several genome quality control criteria. A high-quality reference phylogeny is available for this genome pool, enabling subsequent phylogeny-based analyses. Also available are curated taxonomic annotations, based on NCBI and GTDB systems.

    • Domains: Bacteria, Archaea

    • Number of genomes: 10,575

    • Total length (bp): 32,861,886,373

    • Citation: Zhu Q, Mai U, Pfeiffer W, et al. Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea. Nat Commun. 2019. 10(1):5477. doi: 10.1038/s41467-019-13443-4.

    • Numbers of taxonomic units:

      • Kingdoms: 2

      • Phyla: 146

      • Classes: 89

      • Orders: 196

      • Families: 422

      • Genera: 2,081

      • Species: 9,105

      • Strains: 89

      • Note: Nucleotide sequences per genome were concatenated with a linker of 20 “N”s.

  4. Rep200: NCBI representative and reference microbial genomes, corresponding to RefSeq release 200 (2020-05-14)

    • Genomes: 11,955

    • Nucleotides: 926,894

    • Basepairs: 62,823,581,921 (excluding gaps)

    • Numbers of taxonomic units:

      • Archaea: 419

      • Bacteria: 11080

      • Fungi: 320

      • Protozoa: 88

      • Viral: 48

  5. Rep94: NCBI representative and reference microbial genomes, corresponding to RefSeq release 94.

    • Domains: Bacteria, Archaea

    • Number of genomes: 5,808

    • Total length (bp): 23,165,526,011

    • Note: Nucleotide sequences per genome were concatenated with a linker of 20 “N”s.

    • Numbers of taxonomic units:

      • Kingdoms: 2

      • Phyla: 38

      • Classes: 85

      • Orders: 186

      • Families: 427

      • Genera: 1,931

      • Species: 5,636

      • Strains: 84

  6. Rep82: NCBI representative and reference microbial genomes, corresponding to RefSeq release 82.

    • Not available anymore for new processing

    • Domains: Bacteria, Archaea, Viruses/Viroids

    • Number of genomes: 10,519

    • Total length (bp): 20,387,349,319

    • Note: Plasmids were isolated from bacterial and archaeal host genomes and considered as separate genomes.

    • Numbers of taxonomic units:

      • Kingdoms: 6

      • Phyla: 55

      • Classes: 362

      • Orders: 182

      • Families: 452

      • Genera: 2,264

      • Species: 11,852

      • Strains: 4,263

Metatranscriptome processing

Qiita currently has one active Metatranscriptome data analysis pipeline, as follows:

  1. Ribosomal read filtering via SortMeRNA; details below. This produces a Ribosomal reads and a Non-ribosomal reads artifact/

  2. Sequence profiling via Woltka; for more information see details above.

Sample processing guidelines for metatranscriptomic data

Ribosomal read filtering

SortMeRNA is used for removal of ribosomal reads from quality filtered Metatranscriptome data

Latest SortMeRNA version: v2.1

Input: Quality filtered Metatranscriptome reads (FASTA/FASTQ) Ribosomal reads are identified by searching against pre-curated rRNA databases. Currently, rRNA databases covering bacteria, archaea and eukarya were downloaded and indexed from SILVA and Rfam. Currently indexed databases and their clustering ids:

  • silva-bacterial-16s-id 90%

  • silva-bacterial-23s-id 98%

  • silva-archaeal-16s-id 95%

  • silva-archaeal-23s-id 98%

  • silva-eukarya-18s-id 95%

  • silva-eukarya-28s-id 98%

  • rfam-5s-database-id 98%

  • rfam-5.8s-database-id 98%

The above databases and ID cut-offs were chosen to work with a range of samples including more diverse/complex environmental samples.

Building Custom databases

Custom databases can also be built in addition to the above mentioned databases. Custom databases can be built by using the using the ARB package to extract FASTA files for:

  • 16S bacteria, 16S archaea and 18S eukarya using SSURef_NR99_119_SILVA_14_07_14_opt.arb

  • 23S bacteria, 23S archaea and 28S eukarya using LSURef_119_SILVA_15_07_14_opt.arb

The built databases will then have to be indexed before running SortMeRNA. Reference database(s) and their corresponding indexes separated by “,” and multiple databases are separated by “:”

SortMeRNA Usage

SortMeRNA filters the ribosomal from the non-ribosomal reads from the input sample dataset (via BLAST search)and outputs two fasta/q files containing the ribosomal and non-ribosomal reads respectively. Additionally, a summary file showing the proportion of reads matching to each of the screened ribosomal databases can also be made available. Default options have been set to report only the best alignment per read reaching E-value. For non ribo-depleted samples (i.e. total RNA), the ribosomal reads obtained from SortMeRNA can be further used in taxonomic/compositional analysis. In the case of ribo-depleted samples, only the non-ribosomal reads are used in downstream analyses such as assembly, mapping, differential gene abundance analyses etc.

Genome Isolate Processing

This workflow can be used for assembling (meta)-genomes (isolate and/or metagenomic data) using SPAdes v3.15.2 at set k-mer lengths of 21,33,55,77,99 and 127.

The assembled contigs are stored in per sample FASTA files (originally scaffolds.fna in SPAdes).

The –merge option merges the forward and reverse reads prior to assembly (preferable for isolate or metagenomes with high sequencing depth), the non-merge option works well for shallow shotgun data and/or complex environmental communities.

The –meta flag is used to assemble metagenomic datasets.