Experimental Design and Sample Preparation ============================================ **Key Takeaways** **There are caveats and biases in every method** - Understand the biases in your methods - Understand how that bias will affect results **Plan and optimize well, then stick to one protocol!** Experimental Design ------------------------------------------------------------------ **How do I start a microbiome amplicon study? Plan ahead!** It is particularly important to consider the details of your study before you start - Cannot add samples later - Cannot remove contamination after samples are collected - Cannot add controls necessary to validate quality after sequencing - Cannot remove bias if other factors very with treatment of interest - Cannot get more microbial DNA if host DNA dominates .. figure:: ../images/experimental_design_flaw.jpg :scale: 60 % :alt: experimental_design_flaw Source: Aaron Bacall Source: www.art.com **Start with a question and hypothesis.** - Consider expected effect size on microbiome composition - Calculate how many samples will be needed to detect if the effect exists - Consider changes community wide and detecting changes in taxon abundance and presence - Keep in mind not every sample will pass every QC step - samples will be lost - Larger N - Parallel processing replicates **Always step back and ask: "Will a whole community profile answer my question?"** - Often people hypothesize that "the microbiome will change", without considering what that actually means - Amplicon sequencing doesn't tell you community function - It won't give you strain or even consistent species level resoultion - In addition to *E. coli*, there are five other species in the genus *Escheria* and some are commensal in the human gut Sampling and Controls ------------------------------------------------------------------ **How will you collect your samples?** **Generally consider in collection:** - **Size of sample** - Determines available DNA - Can you control an equal amount from each subject - *absolute quantification* - Reserve sample - mix evenly so reserve is the same as utilized sample - **Sampling vessel** - Ease of collection - Sterility - Stabilization solution - Need DNA *and* RNA - Other 'omics - *i.e. mass spectrometry affected by salt concentration* .. figure:: ../images/genotech_tube.jpg :scale: 40 % :alt: genotech_tube Source: www.dnagenotek.com - **Storage** - Immediate storage -80C - Long term stabilization solutions can introduce bias - Prevent bacterial overgrowth at room temperature **Key controls to keep in mind:** - **Uniformity** - Collect samples at the same time and method across treatment groups - Same for processing DNA extraction and library preparation for sequencing - **Randomization** - Consider many factors that could vary across treatment groups - Time of day, season - Sampling in same location - Mouse cage effects - Co-housing - Mixing bedding and food - **Sterility** - Collect samples, but also sample microbes in the local environment if possible - Always include blank water samples - Collection blank - Extraction blank - Library preparation blank - Sequencing blank - Work in biosaftey hood whenever possible - Particularly sensitive to contamination before PCR steps - **Mock community** - Known taxonomy and abundance - **Generous donor** - If sampling / extracting / sequencing in batches include one consistent sample across each **If working with human subjects:** .. figure:: ../images/microbe_human_teddy.gif :scale: 100 % :alt: human_microbe_teddy Source: Charis Tsevis - Consider ease of self-collection - Different body sites have very different communities - Even location on a stool sample (inside versus outside) - Biopsies are loaded with human DNA - IRB compliant methods - Gloves - Way to catch stool in toilet and disposal - Proper instructions - Immediate acquisition versus returning samples later - Cannot control amount provided if self-collection - Must stablize to prevent bacterial overgrowth, or 'blooms' .. figure:: ../images/genotech_sampling.png :scale: 40 % :alt: genotech_sampling Source: www.dnagenotek.com DNA Extraction ------------------------------------------------------------------ **DNA extraction techniques vary significantly in community bias** - Every kit introduces bias, so pick one and stick with it! - There are many options, so research which is best for your long term goals - The `Earth Microbiome Project `_ has well documented protocols. - Efficiency at lysing cells is important - Bead beating versus chemical lysis - Mechanical bead beating seems to be most thorough gram +/- - Keep in mind bead beating heats the sample through friction, don't over-do it! - Take time to properly optimize and document - Always note extraction batch, who performed the extraction, and the kit lot number - Always include extraction blanks, all kits have contaminants - `kit'ome `_. - Some kits perform dual DNA and RNA extraction Amplification ------------------------------------------------------------------ **PCR amplification of the marker gene of interest** - **Use known protocols and established primers** - Different variable regions bias the community differently, stick to what's known! - The `Earth Microbiome Project `_ has well documented PCR protocols (*16S, 18S, ITS*) - Always optimize your PCR for the expected sequence length and concentration - Include PCR blank to control for processing contamination - **Sequencing Depth** - Generally 10k - 100k sequences per sample is adequate coverage - Diminishing returns at greater depths - Taxonomic resolution based on sequence length not sequencing depth - Many reads increase sequencing errors, qc filtered - Analyses are rarefied or relative abundance - Rare organisms will still be rare, depth doesn't change proportions Earth Microbiome Project V4 Primers 515F FWD: GTGYCAGCMGCCGCGGTAA 806R REV: GGACTACNVGGGTWTCTAAT *The primer sequences in EMP protocols are always listed in the 5′ -> 3′ orientation. This is the orientation that should be used for ordering.* Components of full reaction: _____5′ Illumina adapter_________________________Golay barcode_____Pad________Linker___Forward primer FWD: AATGATACGGCGACCACCGAGATCTACACGCT XXXXXXXXXXXX TATGGTAATT GT GTGYCAGCMGCCGCGGTAA - Each sample PCR reaction will be performed with it's own unique barcode combination - This allows bioinformatic demultiplexing - assigning sequences to their respective sample Sequencing Platforms ------------------------------------------------------------------ **Illumina** .. figure:: ../images/illumina_theory.png :scale: 90 % :alt: illumina_theory Illumina sequencing molecular biology Source: Jaroslaw Grzadziel (Research Gate) - By far the most widely used sequencing method - Millions of sequences allow multiplexing many samples on one run - High coverage of the community (~10k sequences per sample is ideal depending on complexity) - Short sequences 100-250bp generally cover 1-3 variable regions - Allows high coverage, but lose resolution past genus level - Paired end is better than single end .. figure:: ../images/paired_end_sequencing.jpg :scale: 90 % :alt: paired_end_sequencing Mechanism of paired end Illumina sequencing Source: Christine King **Illumina Amplicon Sequencing at VANTAGE** - 250 bp sequence reads (Covers entire V4 region) - Dual-index sequencing strategy - increased multiplexing - Each MiSeq run produces >5,000,000 reads (Great for multiplexing) - Best option for 16S rRNA gene amplicon sequencing - VANTAGE also offers HiSeq, NextSeq, and NovaSeq Illumina sequencing platforms - MiSeq has 250bp length, other platforms only offer shorter reads so plan ahead **PacBio and Nanopore** - Could do full 16S gene - Very high strain level resolution - But *much* lower coverage and multiplexing = higher costs - Higher error rates Mapping File ------------------------------------------------------------------ **How do you store all of this information about samples in a useful way?** - **A mapping file is composed of:** - Each row representing a sample - Each column representing some information about each sample - Are TSV - tab separated values - TSV is a .txt output format from Excel - A header line starting with a # indicates the column names .. figure:: ../images/map_exp.png :scale: 30 % :alt: map_exp Source: Example mapping file in excel. - Always start with **#sampleid** column - A unique id for each sample - cannot be the same as any other sample - Include columns for the unique sample **barcodes** - Allows demultiplexing of sequences to their respective sample - Also include **control and qc** information about DNA extraction batch, person extracting, PCR batch, sequencing run, cage... - Allows testing if these factors influenced the community composition - Have columns indicating which samples are blanks, mock community, and generous donor samples - Have columns for your **treatement** and any **covariates** of interest - Age, sex, BMI, collection season... - Last column should be **description** if you need backward compatibility with QIIME1 For more detailed information check out the full `QIIME2 Metadata Guide! `_ **Best Practices** - Rule Number 1: Don't get fancy! - Don't use spaces, use _ (underscore) instead - Except #sampleid, the only punctuation in this column can be dashes '-' (no underscores) - Don't include other types of punctuation, this will only cause problems later on! - Leading and trailing white spaces are ignored - I'd stick to all lower-case (case insensitive) characters - Not required, but may save you a *lot* of trouble with weird errors later on!