genome-sampler sample-diversity¶
Sample context sequences to a collection of divergent sequences. This is useful for retaining the diversity of the context sequences in a smaller data set, and for downsampling abundant lineages.
Citations¶
Bolyen et al., 2020; Rognes et al., 2016
Inputs¶
- context_seqs:
FeatureData[Sequence] The context sequences to be sampled from.[required]
Parameters¶
- percent_id:
Float%Range(0, 1, inclusive_end=True) The percent identity threshold for clustering. Context sequences will be dereplicated such that no pair of retained sequences will have a percent identity to one another that is this high.[required]
- n_threads:
Int%Range(1, None) The number of threads to use for processing.[default:
1]
Outputs¶
- selection:
FeatureData[Selection] The selected ids (i.e., the diversity-sampled context sequences).[required]
- Bolyen, E., Dillon, M. R., Bokulich, N. A., Ladner, J. T., Larsen, B. B., Hepp, C. M., Lemmer, D., Sahl, J. W., Sanchez, A., Holdgraf, C., Sewell, C., Choudhury, A. G., Stachurski, J., McKay, M., Engelthaler, D. M., Worobey, M., Keim, P., & Gregory Caporaso, J. (2020). Reproducibly sampling SARS-CoV-2 genomes across time, geography, and viral diversity. F1000 Research, 9(657), 657. 10.12688/f1000research.24751.1
- Rognes, T., Flouri, T., Nichols, B., Quince, C., & Mahé, F. (2016). VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584. 10.7717/peerj.2584