Skip to article frontmatterSkip to article content

Step-by-step tutorial

This document illustrates how to use genome-sampler on a small tutorial data set using step-by-step instructions. This gives you complete control over the analysis, but is more complex to run relative to using the genome-sampler Snakemake workflow as illustrated in Snakemake tutorial.

Download tutorial data

The tutorial data set used here is intended for educational purposes only. If you’re interested in using these sequences for other analyses, we recommend starting with sequence repositories such as GISAID or NCBI Genbank, which have much more recent versions.

Download the tutorial sequences and corresponding metadata using the following commands:

[Command Line]
[Python API]
[R API]
[View Source]
wget -O 'context-metadata.tsv' \
  'https://genome-sampler.readthedocs.io/en/latest/data/tutorial/context-metadata.tsv'

wget -O 'focal-metadata.tsv' \
  'https://genome-sampler.readthedocs.io/en/latest/data/tutorial/focal-metadata.tsv'
[Command Line]
[Python API]
[R API]
[View Source]
wget -O 'context-seqs-raw.fasta' \
  'https://genome-sampler.readthedocs.io/en/latest/data/tutorial/context-seqs-raw.fasta'

wget -O 'focal-seqs-raw.fasta' \
  'https://genome-sampler.readthedocs.io/en/latest/data/tutorial/focal-seqs-raw.fasta'

Importing

You’ll begin the workflow by importing the fasta files into QIIME 2 Artifacts. QIIME 2 Artifacts are structured zip files which will contain the data (still in fasta format), but also some QIIME 2-specific information that (among other things) automates your method recording so you don’t have to do that.

[Command Line]
[Python API]
[R API]
[View Source]
qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-format GISAIDDNAFASTAFormat \
  --input-path context-seqs-raw.fasta \
  --output-path context-seqs.qza
qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-format GISAIDDNAFASTAFormat \
  --input-path focal-seqs-raw.fasta \
  --output-path focal-seqs.qza

If you’re obtaining context sequences from a public repository, you may encounter context sequences that don’t have associated metadata records. Steps in this workflow that require context sequence metadata would fail as a result. At this stage we therefore filter any context sequences that don’t have metadata records.

[Command Line]
[Python API]
[R API]
[View Source]
qiime feature-table filter-seqs \
  --i-data context-seqs.qza \
  --m-metadata-file context-metadata.tsv \
  --o-filtered-data context-seqs-w-metadata.qza

Quality filtering

Next, we’ll apply a quality filter to the sequence data. Technically this is an optional step for both the context and focal sequences, but in practice if you obtain your context sequences from a public repository you should consider the context sequence filtering to be essential. You can use this to remove sequences based on their length (if they’re too short or too long), and based on the proportion of ambiguous (e.g., N) nucleotide characters that they contain. Here we’ll apply this both to the context sequences and the focal sequences. We’ll remove sequences with a proportion of 1% or more ambiguous characters.

[Command Line]
[Python API]
[R API]
[View Source]
qiime genome-sampler filter-seqs \
  --i-sequences context-seqs-w-metadata.qza \
  --p-max-proportion-ambiguous 0.01 \
  --o-filtered-sequences filtered-context-seqs.qza
qiime genome-sampler filter-seqs \
  --i-sequences focal-seqs.qza \
  --p-max-proportion-ambiguous 0.01 \
  --o-filtered-sequences filtered-focal-seqs.qza

At any time, you could get some summary information about your sequences using the following command:

[Command Line]
[Python API]
[R API]
[View Source]
qiime feature-table tabulate-seqs \
  --i-data filtered-focal-seqs.qza \
  --o-visualization filtered-focal-seqs-summary.qzv

That command will create a QIIME 2 visualization, which most frequently would be viewed using QIIME 2 View. Try viewing that file and finding information such as the number of sequences present in this file and the median length of the sequences. (Your data is not uploaded to a server when you visit QIIME 2 View, so you don’t need to be concerned about exposing sensitive research data.) For additional information on how to view QIIME 2 visualizations, see here.

Sampling genomes

We’re now ready to start sampling our data, and we’ll do this in three steps. These subsampling steps are independent, so can be run in any order.

Sampling across time ⏱️

First, we’ll sample across time. By default, this will select seven genomes per seven day period. The file that gets generated as a result here isn’t something that is very useful to view directly, but we’ll use it in a few minutes to see how many sequences were retained at this step.

[Command Line]
[Python API]
[R API]
[View Source]
qiime genome-sampler sample-longitudinal \
  --i-context-seqs filtered-context-seqs.qza \
  --m-dates-file context-metadata.tsv \
  --m-dates-column date \
  --o-selection date-selection.qza

Sampling across biological diversity 🌲

Next, we’ll sample across viral diversity. This will cluster sequences at a percent identity threshold of 99.95%, and select the centroid sequence from each cluster to include in downstream analyses.

[Command Line]
[Python API]
[R API]
[View Source]
qiime genome-sampler sample-diversity \
  --i-context-seqs filtered-context-seqs.qza \
  --p-percent-id 0.9995 \
  --o-selection diversity-selection.qza

Sampling across location of isolation 🗺️

Last, we’ll sample near neighbors of the focal sequences from the context sequences. Of the near-neighbor sequences that are identified for each focal sequence (up to 10 by default), 3 sequences will be selected at random such that each location (defined in the context-metadata.tsv file) has an equal probability of being selected for inclusion in downstream analysis.

[Command Line]
[Python API]
[R API]
[View Source]
qiime genome-sampler sample-neighbors \
  --i-focal-seqs filtered-focal-seqs.qza \
  --i-context-seqs filtered-context-seqs.qza \
  --m-locale-file context-metadata.tsv \
  --m-locale-column location \
  --p-percent-id 0.9999 \
  --p-samples-per-cluster 3 \
  --o-selection neighbor-selection.qza

Combine and summarize samples

Now, we’ll combine the results of the three sampling approaches:

[Command Line]
[Python API]
[R API]
[View Source]
qiime genome-sampler combine-selections \
  --i-selections date-selection.qza diversity-selection.qza neighbor-selection.qza \
  --o-combined-selection master-selection.qza

And we’ll generate a summary of the full selection process:

[Command Line]
[Python API]
[R API]
[View Source]
qiime genome-sampler summarize-selections \
  --i-selections date-selection.qza diversity-selection.qza neighbor-selection.qza \
  --o-visualization selection-summary.qzv

The selection-summary.qzv file provides a summary of the sampling run. You can view this file using QIIME 2 View. Be sure to click the Provenance tab after loading the file on that page - that will provide full details on the workflow that was executed. How many sequences were retained by each sampling step?

Preparing sampled data for downstream applications

We’re now ready to start compiling our final data set. To do this, we’ll use the master-selection.qza file to select specific context sequences from the filtered-context-seqs.qza file that was generated earlier.

[Command Line]
[Python API]
[R API]
[View Source]
qiime feature-table filter-seqs \
  --i-data filtered-context-seqs.qza \
  --m-metadata-file master-selection.qza \
  --o-filtered-data subsampled-context-seqs.qza

Next, merge the resulting subsampled context sequences with the focal sequences to create the final QIIME 2 artifact.

[Command Line]
[Python API]
[R API]
[View Source]
qiime feature-table merge-seqs \
  --i-data subsampled-context-seqs.qza filtered-focal-seqs.qza \
  --o-merged-data sequences.qza

Exporting data

Optional: alignment and phylogenetic reconstruction with QIIME 2

QIIME 2 contains tools for preliminary sequence alignment and phylogenetic tree generation in the q2-alignment and q2-phylogeny plugins. QIIME 2 is not designed for high-accuracy phylogenetic analysis, but it can help you to generate and visualize initial alignments and trees.

These steps would take the sequences.qza file as input, so if you’re interested in using these you could postpone or skip the export step that you ran above.

If you’d like to learn more about this, refer to Downstream alignment and phylogenetics.