This document illustrates how to use genome-sampler
on a small tutorial data set using Snakemake.
This makes genome-sampler very straight-forward to run, but more challenging to customize to your needs relative to using it in a step-by-step manner as illustrated in Step-by-step tutorial.
Download tutorial data¶
The tutorial data set used here is intended for educational purposes only. If you’re interested in using these sequences for other analyses, we recommend starting with sequence repositories such as GISAID or NCBI Genbank, which have much more recent versions.
Download the tutorial sequences and corresponding metadata using the following commands:
mkdir tutorial-data/
wget -O tutorial-data/context-metadata.tsv https://raw.githubusercontent.com/caporaso-lab/genome-sampler/r2020.8/snakemake/tutorial-data/context-metadata.tsv
wget -O tutorial-data/focal-metadata.tsv https://raw.githubusercontent.com/caporaso-lab/genome-sampler/r2020.8/snakemake/tutorial-data/focal-metadata.tsv
wget -O tutorial-data/context-seqs.fasta https://raw.githubusercontent.com/caporaso-lab/genome-sampler/r2020.8/snakemake/tutorial-data/context-seqs.fasta
wget -O tutorial-data/focal-seqs.fasta https://raw.githubusercontent.com/caporaso-lab/genome-sampler/r2020.8/snakemake/tutorial-data/focal-seqs.fasta
Using genome-sampler
(Snakemake workflow)¶
The full genome-sampler
workflow can be run using Snakemake Köster & Rahmann, 2012.
Begin by installing Snakemake.
Download the Snakemake and associated config file using curl
as follows:
wget -O Snakefile https://raw.githubusercontent.com/caporaso-lab/genome-sampler/r2020.8/snakemake/Snakefile
wget -O config.yaml https://raw.githubusercontent.com/caporaso-lab/genome-sampler/r2020.8/snakemake/config.yaml
Place the resulting Snakemake file in the same folder as the sequence and metadata files that you downloaded above. Then run:
snakemake
When this workflow completes, there will be two primary output files that you’ll use.
sequences.fasta
will contain your subsampled context sequences and your focal sequences.
You should use this file for downstream analyses, such as alignment and phylogenetic analyses.
selection-summary.qzv
will provide a summary of the sampling run.
You can view this file using QIIME 2 View.
You should now be able to move on to analysis of your own data.
See Adapting Snakemake workflow for application to your own data to learn about what changes you might want to make to your Snakefile
before running your own analysis.
Adapting Snakemake workflow for application to your own data¶
The Snakemake workflow presented above is a good starting point for your own analyses.
There are typically a few things to do to adapt Snakefile
for your own data.
These changes will be made to the config.yaml
file which should be found alongside the Snakefile
.
- Modify input filepaths as needed. The input filepaths listed correspond to the names of the files provided for the tutorial. You can either name your files using those names, or update the input filepath values.
- Modify output filepaths if you’d like these to be different from the ones used in the tutorial.
- Modify longitudinal, neighbor, and diversity sampling parameters as desired.
If you end up experimenting with different values for these parameters, which we encourage, we would love to hear about your findings.
Be aware that increasing the
*_percent_id
parameters will increase the runtime of your analysis, and decreasing those values will decrease the runtime of your analysis.
🐍
- Köster, J., & Rahmann, S. (2012). Snakemake–a scalable bioinformatics workflow engine. Bioinformatics, 28(19), 2520–2522.