Skip to article frontmatterSkip to article content

sample-neighbors

genome-sampler sample-neighbors

Sample context sequences that are near-neighbors of focal sequences, including sampling over locales if provided. This is useful for avoiding apparent monophylies of focal sequences.

Citations

Bolyen et al., 2020; Rognes et al., 2016

Inputs

focal_seqs: FeatureData[Sequence]

The focal sequences.[required]

context_seqs: FeatureData[Sequence]

The context sequences to be sampled from.[required]

Parameters

percent_id: Float % Range(0, 1, inclusive_end=True)

The percent identity threshold for searching. If a context sequence matches a focal sequence at greater than or equal to this percent identity, the context sequence will be considered a neighbor of the focal sequence.[required]

samples_per_cluster: Int % Range(1, None)

The number of context sequences to sample per cluster, where clusters are the up-to max_accepts context sequences that match at percent_id to a given focal sequence.[required]

locale: MetadataColumn[Categorical]

The metadata column that contains locale data. If provided, sampling will be performed across locales. (While this was designed for locale sampling, any categorical metadata column could be provided.) Each occurrence of missing data will be treated as a unique locale/category.[optional]

max_accepts: Int % Range(1, None)

The maximum number of context sequences that match a focal sequence at percent_id or higher that will be identified. Up to samples_per_cluster of these will be sampled.[default: 10]

n_threads: Int % Range(1, None)

The number of threads to use for processing.[default: 1]

seed: Int % Range(0, None)

Seed used for random number generators.[optional]

Outputs

selection: FeatureData[Selection]

The selected ids (i.e., the subsampled neighbors).[required]

References
  1. Bolyen, E., Dillon, M. R., Bokulich, N. A., Ladner, J. T., Larsen, B. B., Hepp, C. M., Lemmer, D., Sahl, J. W., Sanchez, A., Holdgraf, C., Sewell, C., Choudhury, A. G., Stachurski, J., McKay, M., Engelthaler, D. M., Worobey, M., Keim, P., & Gregory Caporaso, J. (2020). Reproducibly sampling SARS-CoV-2 genomes across time, geography, and viral diversity. F1000 Research, 9(657), 657. 10.12688/f1000research.24751.1
  2. Rognes, T., Flouri, T., Nichols, B., Quince, C., & Mahé, F. (2016). VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584. 10.7717/peerj.2584