genome-sampler sample-neighbors¶
Sample context sequences that are near-neighbors of focal sequences, including sampling over locales if provided. This is useful for avoiding apparent monophylies of focal sequences.
Citations¶
Bolyen et al., 2020; Rognes et al., 2016
Inputs¶
- focal_seqs:
FeatureData[Sequence] The focal sequences.[required]
- context_seqs:
FeatureData[Sequence] The context sequences to be sampled from.[required]
Parameters¶
- percent_id:
Float%Range(0, 1, inclusive_end=True) The percent identity threshold for searching. If a context sequence matches a focal sequence at greater than or equal to this percent identity, the context sequence will be considered a neighbor of the focal sequence.[required]
- samples_per_cluster:
Int%Range(1, None) The number of context sequences to sample per cluster, where clusters are the up-to
max_acceptscontext sequences that match atpercent_idto a given focal sequence.[required]- locale:
MetadataColumn[Categorical] The metadata column that contains locale data. If provided, sampling will be performed across locales. (While this was designed for locale sampling, any categorical metadata column could be provided.) Each occurrence of missing data will be treated as a unique locale/category.[optional]
- max_accepts:
Int%Range(1, None) The maximum number of context sequences that match a focal sequence at
percent_idor higher that will be identified. Up tosamples_per_clusterof these will be sampled.[default:10]- n_threads:
Int%Range(1, None) The number of threads to use for processing.[default:
1]- seed:
Int%Range(0, None) Seed used for random number generators.[optional]
Outputs¶
- selection:
FeatureData[Selection] The selected ids (i.e., the subsampled neighbors).[required]
- Bolyen, E., Dillon, M. R., Bokulich, N. A., Ladner, J. T., Larsen, B. B., Hepp, C. M., Lemmer, D., Sahl, J. W., Sanchez, A., Holdgraf, C., Sewell, C., Choudhury, A. G., Stachurski, J., McKay, M., Engelthaler, D. M., Worobey, M., Keim, P., & Gregory Caporaso, J. (2020). Reproducibly sampling SARS-CoV-2 genomes across time, geography, and viral diversity. F1000 Research, 9(657), 657. 10.12688/f1000research.24751.1
- Rognes, T., Flouri, T., Nichols, B., Quince, C., & Mahé, F. (2016). VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584. 10.7717/peerj.2584