Concepts of Community Analysis ============================================ **How can we examine microbial communities?** **Let's switch datasets and look at:** `The American Gut Project (AGP) `_ .. figure:: ../images/agp.jpg :scale: 140 % :alt: agp American Gut Project (AGP) logo. Source: http://humanfoodproject.com/americangut/ **Why?** - Clustering features, creating a phylogeny, and assigning taxonomy are the most computationally intensive - We used very small and sparse data - This subset of the AGP has 1,375 individuals - There are 5,590 features (bacteria) and ~200 metadata measurements - You may get processed data from a colaborator / website already as a table, phylogeny... - Need to know how to import at any stage of the project - `QIIME2 Importing Tutorial `_ - There are far more options than even they show - Check the docs: qiime tools import --help - There are many types of data: qiime tools import --show-importable-types - For each type there are many subformats: qiime tools import --show-importable-formats **Importing the AGP** 1. Look at the files in the american_gut folder 2. Import the biom table 3. Import the bacterial phylogeny 4. Look at the available metadata .. code-block:: bash # See the files we will work with ls -lsh american_gut # Import the biom count table of features by samples qiime tools import \ --input-path american_gut/1_0_biom.biom \ --type 'FeatureTable[Frequency]' \ --input-format BIOMV100Format \ --output-path american_gut/1_1_biom.qza # Import the bacterial 16S phylogeny qiime tools import \ --input-path american_gut/1_0_tree.tre \ --output-path american_gut/1_1_tree.qza \ --type 'Phylogeny[Rooted]' # Import the bacterial taxonomy qiime tools import \ --input-path american_gut/1_0_tax.tsv \ --output-path american_gut/1_1_tax.qza \ --type FeatureData[Taxonomy] \ --input-format HeaderlessTSVTaxonomyFormat # Create a visual of the metadata (map.txt) qiime metadata tabulate \ --m-input-file american_gut/1_0_map.txt \ --o-visualization american_gut/1_1_metadata.qzv # Look at metadata available qiime tools view american_gut/1_1_metadata.qzv # See the new files we generated ls -lsh american_gut .. figure:: ../images/agp_metadata.png :scale: 80 % :alt: agp_metadata AGP metadata is quite extensive, giving us a lot of options. **We should also look at the table to see what we're working with...** .. code-block:: bash # Summarize the table qiime feature-table summarize \ --i-table american_gut/1_1_biom.qza \ --o-visualization american_gut/1_1_biom_sumamry.qzv # View the table summary qiime tools view american_gut/1_1_biom_sumamry.qzv .. figure:: ../images/agp_table_raw.png :scale: 40 % :alt: agp_table_raw Over 29 million sequences passed QC and were assigned to samples. Quality Control ------------------------------------------------------------------ **What should we do to improve the resolution and remove noise?** **1. Filter features from the observation table by:** - Minimum or maximum frequency (counts across samples) - Minimum or maximum ubiquity (proportion of samples) - Where metadata is equal to a certain value - Removing specific samples by the unqiue id **Options for specifically filtering features:** - Lots of options: qiime feature-table filter-features --help --i-table ARTIFACT_PATH The feature table from which features should be filtered. [required] --p-min-frequency INTEGER The minimum total frequency that a feature must have to be retained. [default: 0] --p-max-frequency INTEGER The maximum total frequency that a feature can have to be retained. If no value is provided this will default to infinity (i.e., no maximum frequency filter will be applied). [optional] --p-min-samples INTEGER The minimum number of samples that a feature must be observed in to be retained. [default: 0] --p-max-samples INTEGER The maximum number of samples that a feature can be observed in to be retained. If no value is provided this will default to infinity (i.e., no maximum sample filter will be applied). [optional] --m-metadata-file MULTIPLE_PATH Metadata file or artifact viewable as metadata. This option may be supplied multiple times to merge metadata. Feature metadata used with `where` parameter when selecting features to retain, or with `exclude_ids` when selecting features to discard. [optional] --p-where TEXT SQLite WHERE clause specifying feature metadata criteria that must be met to be included in the filtered feature table. If not provided, all features in `metadata` that are also in the feature table will be retained. [optional] --p-exclude-ids Samples to remove --p-no-exclude-ids If true, the features selected by `metadata` or `where` parameters will be excluded from the filtered table instead of being retained. [default: False] --o-filtered-table ARTIFACT_PATH FeatureTable[Frequency] The resulting feature table filtered by feature. [required if not passing --output-dir] --output-dir DIRECTORY Output unspecified results to a directory **Let's remove really rare features, to cut down on sequencing artefacts** - This is filtering by total **abundance** **Let's also remove features not in at least 20% of samples** - The percentage or proportion of individuals a feature appears in is known as the **ubiquity** .. code-block:: bash # Filter features with less than 50 total counts across all samples qiime feature-table filter-features \ --i-table american_gut/1_1_biom.qza \ --o-filtered-table american_gut/1_2_biom_min_50.qza \ --p-min-frequency 50 # Filter features in less than 20 percent of samples # Note: I know N = 1,375 / 5 = 275 qiime feature-table filter-features \ --i-table american_gut/1_2_biom_min_50.qza \ --o-filtered-table american_gut/1_2_biom_min_50_ubiq_20.qza \ --p-min-samples 275 # Summarize the table qiime feature-table summarize \ --i-table american_gut/1_2_biom_min_50_ubiq_20.qza \ --o-visualization american_gut/1_2_biom_summary.qzv # View the table summary qiime tools view american_gut/1_2_biom_summary.qzv **Let's see how we did:** .. figure:: ../images/agp_filtered.png :scale: 70 % :alt: agp_filtered Looks like we cut down our features from 5,590 to 664, not bad... .. figure:: ../images/agp_feature_counts.png :scale: 80 % :alt: agp_feature_counts A few features dominate the counts... **Keep in mind that microbiome counts are not normally distributed** - More often follow a Poisson / Negative Binomial distribution - This roughly means there are often many low / rare values and some very high outliers .. figure:: ../images/poisson.png :scale: 150 % :alt: poisson Different Poisson distributions can be modeled, microbiome counts will look like the yellow most often. Source: Wikipedia We also want to remove samples with low counts, and there is a nice tool to choose that cutoff... .. figure:: ../images/agp_sample_cutoff.png :scale: 120 % :alt: agp_sample_cutoff **Keep in mind you could also:** - Remove particular microbial taxon: --p-exclude Wolbachia - Remove specific samples: --p-exclude-ids samples-to-remove.tsv - Remove samples based on metadata: --p-where "sex IN ('male', 'female')" **2. We can remove samples with the filter-samples tool** Let's remove samples with less than 50,000 counts - This is a high cutoff, and will cut us down to only 53 samples - This will make the tutorial easier, but 1,000 or 10,000 is often more than adequate .. code-block:: bash qiime feature-table filter-samples \ --i-table american_gut/1_2_biom_min_50_ubiq_20.qza \ --o-filtered-table american_gut/1_3_biom_samplemin_50k.qza \ --p-min-frequency 50000 qiime feature-table summarize \ --i-table american_gut/1_3_biom_samplemin_50k.qza \ --o-visualization american_gut/1_3_biom_summary.qzv qiime tools view american_gut/1_3_biom_summary.qzv This is nice for our purposes today, but note that some samples still have much higher counts than others... .. figure:: ../images/agp_sam_filter.png :scale: 100 % :alt: agp_sam_filter **Let's work with this cleaner data, while keeping in mind there are tons of other ways to filter!** - Remove sequences before Deblur feature analysis: qiime quality-control exclude-seqs - Remove samples and features based on the mock community: qiime quality-control evaluate-composition - `More extensive filtering documentation `_. **Keep in mind:** - At this point in a real analysis I would back up the filtered files Community Transformation and Examination ---------------------------------------------------------------------------- **Let's look at the data beyond just a table summary!** **The feature-table tool has a lot of options:** .. code-block:: bash qiime feature-table --help Commands: -core-features Identify core features in table -filter-features Filter features from table -filter-samples Filter samples from table -filter-seqs Filter features from sequences -group Group samples or features by a metadata column -heatmap Generate a heatmap representation of a feature table -merge Combine multiple tables -merge-seqs Combine collections of feature sequences -merge-taxa Combine collections of feature taxonomies -presence-absence Convert to presence/absence -rarefy Rarefy table -relative-frequency Convert to relative frequencies -subsample Subsample table -summarize Summarize table -tabulate-seqs View sequence associated with each feature **Options include:** - filtering - grouping / collapsing - transforming - rarefy, relative abundance, presence / absence - creating summaries and figures **I'd suggest creating shortcut variables for the data we imported:** - It's easier than typing the full path everytime, and easier to remember - Their names can be arbitrary, I like to go with something logical though - For best practices they should be called with quotes and a number sign - "$XXXXX" - Without the number sign it will be treated as a word and not a variable holding something else - The quotes are optional today, but in more complex work they will save you a lot of weird errors - If you wanted to analyze different data you only have to switch paths in one place, not in every function call .. code-block:: bash # The Feature table can now be called as: "$table" table=american_gut/1_3_biom_samplemin_50k.qza echo "$table" echo $table echo table # The taxononomy table can now be called as: "$tax" tax=american_gut/1_1_tax.qza echo "$tax" # The phylogeny can now be called as: "$tre" tre=american_gut/1_1_tree.qza echo "$tre" # The metadata can now be called as: "$map" map=american_gut/1_0_map.txt echo "$map" **So let's exmaine how our community is taxonomically distributed!** - Barplots of the taxonomic distribution of features for each sample - These plots are based on relative abundance within a total sample - Do not provide information about absolute microbial load - Note below how I use the shortcuts instead of writing out full paths... - It just looks so much cleaner too! .. code-block:: bash # Make and view a taxonomy barplot qiime taxa barplot \ --i-table "$table" \ --i-taxonomy "$tax" \ --m-metadata-file "$map" \ --o-visualization american_gut/2_0_taxonomy_barplot.qzv qiime tools view american_gut/2_0_taxonomy_barplot.qzv .. figure:: ../images/agp_taxa_barplot.png :scale: 70 % :alt: agp_taxa_barplot **This is a really useful tool for examining our microbial communities** - Just remember how the relative abundance works... - Everything is relative to everything else! - One abundant microbe *decreasing* makes rarer microbes look *more* abundant - One abundant microbe *increasing* makes rarer microbes look *less* abundant Rarefaction and Normalization ------------------------------------------------------------------ **Raw counts are biased:** - Samples have different numbers of sequence counts - Not indicative of absolute bacterial abundance - Artifact of extraction, library preparation and pooling, and sequencing efficiency **Some ways to transform the counts:** -presence-absence Convert to presence/absence -rarefy Rarefy table -relative-frequency Convert to relative frequencies .. code-block:: bash qiime feature-table presence-absence \ ... qiime feature-table rarefy \ ... qiime feature-table relative-frequency \ ... **Presence / Absence**: value of 1 if taxon is present in sample, 0 otherwise **Rarefy**: subsample the same number of sequence from each sample **Relative Frequency**: propotion of each feature out of the total feature counts for each sample **There are other normalization methods, often based on RNA sequencing methods.** - Generally seem more biased than rarefy / relative abundance - Better fit in standard sense, don't seem to adapt well to wide bias across different datasets .. figure:: ../images/norm_rare_rel.gif :scale: 150 % :alt: norm_rare_rel Comparison of different methods to control for different sequence counts. Purple and Orange are what we care about! Source: Weiss, S. *et. al.* Microbiome. 2017 -None No correction for unequal library sizes is applied -Proportion Counts in each column are scaled by the column’s sum -Rarefy Each column is subsampled to even depth without replacement (hypergeometric model) -logUQ Log upper quartile—Each sample is scaled by the 75th percentile of its count distribution; then, the counts are log transformed -CSS Cumulative sum scaling—This method is similar to logUQ, except that CSS enables a flexible sample distribution-dependent threshold for determining each sample’s quantile divisor. Only the segment of each sample’s count distribution that is relatively invariant across samples is scaled by CSS. This attempts to mitigate the influence of larger count values in the same matrix column -DESeqVS Variance stabilization (VS)—For each column, a scaling factor for each OTU is calculated as that OTU’s value divided by its geometric mean across all samples. All of the reads for each column are then divided by the median of the scaling factors for that column. The median is chosen to prevent OTUs with large count values from having undue influence on the values of other OTUs. Then, using the scaled counts for all the OTUs and assuming a Negative Binomial (NB) distribution, a mean-variance relation is fit. This adjusts the matrix counts using a log-like transformation in the NB generalized linear model (GLM) such that the variance in an OTU’s counts across samples is approximately independent of its mean -edgeR-TMM Trimmed Mean by M-Values (TMM)—The TMM scaling factor is calculated as the weighted mean of log-ratios between each pair of samples, after excluding the highest count OTUs and OTUs with the largest log-fold change. This minimizes the log-fold change between samples for most OTUs. The TMM scaling factors are usually around 1, since TMM normalization, like DESeqVS, assumes that the majority of OTUs are not differentially abundant. The normalization factors for each sample are the product of the TMM scaling factor and the original library size **Summary: I'd suggest sticking to relative abundance and rarefied communities!** - Remember how many things can bias microbiome data - The Poisson / Negative Binomial fit in more advanced models doesn't seem to adapt well Alpha and Beta Diversity ------------------------------------------------------------------ **We can use ecological metrics to understand community composition and inter-sample microbiome similarity!** **Alpha Diversity**: Measure of diversity within a sample .. figure:: ../images/alpha_div_theory.jpg :scale: 80 % :alt: alpha_div_theory Alpha diversity metrics can look at richness, evenness, or both within a sample. Source: Finotello. Briefings Bioinformatics. 2016. **Beta Diversity**: Measure of microbiome similarity between a pair of samples .. figure:: ../images/beta_div_theory.png :scale: 80 % :alt: beta_div_theory Beta diversity calculates how similar two total ecosystems are. Source: Kate M. Socratic.org. `There are many metrics (ways to measure) Alpha and Beta Diversity `_ **Both Alpha and Beta Diversity can take into account the microbial phylogeny with certain metrics** Metrics are assigned with the *-p-metric* argument: .. code-block:: bash # Alpha Diversity - Non-Phylogenetic Observed OTUs qiime diversity alpha \ --i-table "$table" \ --p-metric observed_otus \ --o-alpha-diversity american_gut/2_2_observed_otus_vector.qza # Alpha Diversity - Non-Phylogenetic Observed OTUs qiime diversity alpha-phylogenetic \ --i-table "$table" \ --i-phylogeny "$tre" \ --p-metric faith_pd \ --o-alpha-diversity american_gut/2_2_faith_pd_vector.qza # Beta Diversity - Non-Phylogenetic Bray-Curtis qiime diversity beta \ --i-table "$table" \ --p-metric braycurtis \ --o-distance-matrix american_gut/2_2_unweighted_unifrac_distance_matrix.qza # Beta Diversity - Phylogenetic Unweighted UNIFRAC qiime diversity beta-phylogenetic \ --i-table "$table"\ --i-phylogeny "$tre" \ --p-metric unweighted_unifrac \ --o-distance-matrix american_gut/2_2_unweighted_unifrac_distance_matrix.qza **QIIME2 has done a good job wrapping diversity computation with my favorite metrics into one easy script:** - Rarefies the table to a specified depth (necessary for diversity to be computed accurately) - Calculates Alpha and Beta Diversity - **Alpha Metrics**: - **Faith's PD**: Phylogenetic diversity - **Observed OTUs**: Richness of community - **Shannon**: Balances richness and evenness - **Pielou's Evenness**: Evenness of community - **Beta Diversity** - **Unweighted Unifrac**: Presence / absence phylogenetic distance between samples - **Weighted Unifrac**: Abundance weighted phylogenetic distance between samples - **Jaccard**: Presence / absence distance between samples - **Bray Curtis**: Abundance weighted distance between samples - Can be parallelized across multiple CPUs - Generates output visuals Options: --i-table ARTIFACT_PATH The feature table containing the samples over which diversity metrics should be computed. [required] --i-phylogeny ARTIFACT_PATH Phylogenetic tree containing tip identifiers that correspond to the feature identifiers in the table. This tree can contain tip ids that are not present in the table, but all feature ids in the table must be present in this tree. [required] --p-sampling-depth INTEGER_RANGE The total frequency that each sample should be rarefied to prior to computing diversity metrics. [required] --m-metadata-file MULTIPLE_PATH Metadata file or artifact viewable as metadata. This option may be supplied multiple times to merge metadata. The sample metadata to use in the emperor plots. [required] --p-n-jobs INTEGER_RANGE The number of jobs to use for the computation. This works by breaking down the pairwise matrix into n_jobs even slices and computing them in parallel. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. (Description from sklearn.metrics.pairwise_distances) [default: 1] **Let's get it running, then I can explain it...** .. code-block:: bash qiime diversity core-metrics-phylogenetic \ --i-phylogeny "$tre" \ --i-table "$table" \ --p-sampling-depth 50000 \ --m-metadata-file "$map" \ --output-dir american_gut/2_3_diversity_core_phylogenetic **Note the --p-sampling-depth Option:** - This is the depth we will rarefy for each sample - Since we know each sample has 50,000 counts we will use that - Essentially we are just randomly picking 50,000 counts from each sample - While effect size matters, rarefy depth doesn't hugely affect ability to detect a signal - More samples is stronger than deeper rarefy depth! Don't throw away info .. figure:: ../images/rare_1000_10.png :scale: 60 % :alt: rare_1000_10 The left figure is rarefied to 1,500 seqs/sample, the right is rarefied to 10 seqs/samples. Source: Kuczynski, J. Genome Biology. 2010 .. code-block:: bash # List files in diversity results ls -lsh american_gut/2_3_diversity_core_phylogenetic/ What do you notice about the types of files that were made? If for some reason you don't have a phylogeny, there is a non-phylogenetic wrapper method - One could also aruge about the degree to which functional similarity parallels genetic similarity of the 16S rRNA gene **Let's look at the alpha diversity** - Here I get a bit more fancy, wrapping each of the metrics in a for loop - Don't have to write the 'qiime diversity alpha-group-significance...' command four times .. code-block:: bash # Create a shortcut to our diversity output folder diversity_folder=american_gut/2_3_diversity_core_phylogenetic/ ##### ALPHA DIVERSITY ANALYSES ##### # for each alpha diversity metric... for alpha_metric in faith_pd shannon observed_otus evenness do # print the diversity vector name echo "$diversity_folder""$alpha_metric"_vector.qza ### ALPHA ASSOCIATION CATEGORICAL MAPPING ### qiime diversity alpha-group-significance \ --i-alpha-diversity "$diversity_folder""$alpha_metric"_vector.qza \ --m-metadata-file "$map" \ --o-visualization american_gut/2_4_"$alpha_metric"_group_significance.qzv # Close the for loop done Using a for loop and shortcut to the diversity folder are really nice because - We could easily add or remove alpha metrics - We could apply this to a different diversity folder by changing one line - **Think about making you code flexible when building pipelines!!!** .. code-block:: bash # Variable you can change to easily see other metrics alpha_metric=shannon # View Alpha diversity results qiime tools view american_gut/2_4_"$alpha_metric"_group_significance.qzv **Take note of the statistics included** - Which groups have - Keep in mind corrections for multiple variables .. figure:: ../images/kw_test.png :scale: 110 % :alt: kw_test Kruskal Wallis is non-parametric because it uses the rank order and not the actual values. Source: spss-tutorials.com **Let's look at the beta diversity** - Here is another useful trick for running different metrics, commenting them out - Note that if two are left active the beta_metric will be set to the last call .. code-block:: bash # Variable you can comment on / off to easily see other metrics, only use one at a time beta_metric=weighted_unifrac #beta_metric=unweighted_unifrac beta_metric=bray_curtis #beta_metric=jaccard # View Alpha diversity results qiime tools view "$diversity_folder""$beta_metric"_emperor.qzv .. figure:: ../images/pcoa.png :scale: 30 % :alt: pcoa Principle Coordinates of Analysis (PCoA) **We will talk about this in more detail, and specific ways to test hypotheses. Click 'Next' down to the right**