Processing recommendations

Currently, Qiita supports the processing (*) of raw data from:

  1. Target gene barcoded sequencing
  2. Shotgun sequencing

Note that the selected processing are mainly guided so we can perform meta-analyses, this is combine different studies, even from different wet lab techniques or sequencing technologies. Remember to check the Example study processing workflow before continuing.

For more information about meta-analysis, examples and things to consider:

(*) Remember that you can also upload BIOM tables for plotting but not covered here because this is only for raw data.

Target gene barcoded sequencing

For this you can start with raw, not demultiplexed data or per_sample_FASTQ, see Example study processing workflow. Either way, you will need to “Split libraries and QC”, which uses the default in QIIME 1.9.1. Once your demultiplexed and QCed artifact is created you need to select which processing to perform. There are two main ideologies/methodologies to process target gene data: sequence clustering and sequence cleanup.

Sequencing cleanup (preferred)

For this we use deblur. Here 2 BIOM tables are generated by default: fina.biom and final.only-16s.biom. The former is the full biom table, which can be used with any target gene and wetlab work; the latter is the trimmed version to those sequences that match Greengenes at 80% similarity, a really basic and naive filtering. Each of those BIOM tables, is accompanied by a FASTA that contains the representative sequences. The OTU IDs are given by the unique sequence.

Note that deblur needs all sequences to be trimmed at the same length, thus the recommended pipeline is to trim everything at 150bp and the deblur.

Sequencing clustering

Here we use close reference picking, for an explanation of the different picking methods see “Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences”. Here we generate a single BIOM table with the OTUs/per-sample. The OTU IDs are given based on the reference database selected.

Currently, we have the reference databases: Greengenes version 3_8-97, Silva 119 and Unite 7. Depending on your selection is if the reference has a phylogenetic tree.

Shotgun sequencing

Qiita currently has one shotgun metagenomics data analysis pipeline. Note that this is the initial processing pipeline and we will be adding more soon.

With that said, the current workflow is as follows:

  1. Removal of adapter sequence and host contamination using KneadData.
  2. Gene calling and pathway profiling using HUMAnN2.

This workflow starts with per_sample_FASTQ files. We recommend only uploading sequences that have already been through QC and host / human sequence removal. However, all sequence files currently are required to go through KneadData to ensure they are ready for subsequent analyses. Currently, the KneadData command removes adaptor sequences (choice of TruSeq3-PE-2 and NexteraPE-PE) and sequences mapping to the human genome (additional host genomes will become available soon).

Next, the QC’d sequences will be compared against reference databases to determine the presence and abundance of protein-coding functional genes and pathways using HUMAnN2. These are then summarized as BIOM tables, which can be used in subsequent analysis and visualization.

For more information visit the Shotgun Qiita Plugin GitHub page <https://github.com/qiita-spots/qp-shotgun>.