Skip to article frontmatterSkip to article content

sample-diversity

genome-sampler sample-diversity

Sample context sequences to a collection of divergent sequences. This is useful for retaining the diversity of the context sequences in a smaller data set, and for downsampling abundant lineages.

Citations

Bolyen et al., 2020; Rognes et al., 2016

Inputs

context_seqs: FeatureData[Sequence]

The context sequences to be sampled from.[required]

Parameters

percent_id: Float % Range(0, 1, inclusive_end=True)

The percent identity threshold for clustering. Context sequences will be dereplicated such that no pair of retained sequences will have a percent identity to one another that is this high.[required]

n_threads: Int % Range(1, None)

The number of threads to use for processing.[default: 1]

Outputs

selection: FeatureData[Selection]

The selected ids (i.e., the diversity-sampled context sequences).[required]

References
  1. Bolyen, E., Dillon, M. R., Bokulich, N. A., Ladner, J. T., Larsen, B. B., Hepp, C. M., Lemmer, D., Sahl, J. W., Sanchez, A., Holdgraf, C., Sewell, C., Choudhury, A. G., Stachurski, J., McKay, M., Engelthaler, D. M., Worobey, M., Keim, P., & Gregory Caporaso, J. (2020). Reproducibly sampling SARS-CoV-2 genomes across time, geography, and viral diversity. F1000 Research, 9(657), 657. 10.12688/f1000research.24751.1
  2. Rognes, T., Flouri, T., Nichols, B., Quince, C., & Mahé, F. (2016). VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584. 10.7717/peerj.2584