OTU Clustering: A window to analyse uncultured microbial world
Keywords:
16S rDNA, OTUs, Uclust, SUMACLUST, SortMeRNA, USEARCH, taxonomic profilingAbstract
Clustering is the technique used to deal with higher amounts of data by partitioning the data into some groups based on some attributes. Clustering technique has many applications in different fields of science and technology. It is an important tool in genomics and metagenomics which performs taxonomic profiling of the microbial world by grouping 16S RDNA amplicon reads into clusters called as Operational Taxonomic Units (OTUs). With the help of Next Generation Sequencing (NGS) tools and clustering it has become easy for scientists to find the microbial diversities in different environments without culturing the microbes. Assignment of 16s rDNA sequences to the clusters called as OTUs is the main task in metagenomics algorithms and is also the main bottleneck for analysing microbial communities. Taxonomic profiling of 16S rDNA is an important step in Metagenomic pipeline analysis. There are several OTU clustering algorithms which clusters the amplicon reads of 16S rDNA into OTUs, each algorithm use a specific type of clustering technique to cluster the sequence reads. Some of the mostly used algorithms are Uclust, swarm, SUMACLUST, SortMeRNA, USEARCH. In this paper, we first give a brief overview of major clustering techniques and their types. Furthermore, we provide a comprehensive overview of OTU clustering algorithms.
References
P. D’haeseleer, “How does gene expression clustering work?” Nat. Biotechnol., vol. 23, pp. 1499–501, 2005.
N. D. Heintzman, G. C. Hon, R. D. Hawkins, P. Kheradpour, A. Stark, L. F. Harp, Z. Ye, L. K. Lee, R. K. Stuart, and C. W. Ching, “Histone modifications at human enhancers reflect global celltype- specific gene expression,” Nature, vol. 459, no. 7243, pp. 108–112, 2009.
R. K. Chodavarapu, S. Feng, Y. V. Bernatavichute, P.-Y. Chen, H. Stroud, Y. Yu, J. a. Hetzel, F. Kuo, J. Kim, S. J. Cokus, D. Casero, M. Bernal, P. Huijser, A. T. Clark, U. Kramer, S. S. Merchant, X. Zhang, S. E. Jacobsen, and M. Pellegrini, “Relationship between nucleosome positioning and DNA methylation,” Nature, vol. 466, pp. 388–92, 2010.
X. Wang, G. O. Bryant, M. Floer, D. Spagna, and M. Ptashne, “An effect of DNA sequence on nucleosome occupancy and removal,” Nat. Publishing Group, vol. 18, pp. 507–509, 2011.
A. S. Shirkhorshidi, S. Aghabozorgi, T. Y. Wah, T. Herawan, “Big Data Clustering: A Review” Computational Science and Its Applications – ICCSA 2014Volume 8583 of the series Lecture Notes in Computer Science pp 707-720.
M. L. Sogin, H. G. Morrison, J. A. Huber, D. Mark Welch, S. M. Huse, P. R. Neal, J. M. Arrieta, and G. J. Herndl, “Microbial diversity
in the deep sea and the underexplored ‘rare biosphere’”, Proc. Nat. Acad. Sci. USA, vol. 103, no. 32, pp. 12115– 12120, 2006.
S. M. Huse, D. M. Welch, H. G. Morrison, and M. L. Sogin. (2010).Ironing out the wrinkles in the rare biosphere through improved OTU clustering,” Environmental Microbiol., vol. 12, no. 7, pp. 1889–1898.
J. G. Caporaso, J. Kuczynski, J. Stombaugh, K. Bittinger, F. D. Bushman, E. K. Costello, N. Fierer, A. G. Pena, J. K. Goodrich, J. I. Gordon, G. A. Huttley, S. T. Kelley, D. Knights, J. E. Koenig, R. E. Ley, C. A. Lozupone, D. McDonald, B. D. Muegge, M. Pirrung, J. Reeder, J. R. Sevinsky, P. J. Turnbaugh, W. A. Walters, J. Widmann, T. Yatsunenko, J. Zaneveld, and R. Knight, “QIIME allows analysis of high-throughput community sequencing data,” Nature Methods, vol. 7, no. 5, pp. 335–336, May 2010.
R. C. Edgar. (2010). “Search and clustering orders of magnitude faster than BLAST” Bioinformatics, vol. 26, no. 19, pp. 2460–2461.
R. C. Edgar, “UPARSE: highly accurate OTU sequences from microbial amplicon reads,” Nat. Methods, vol. 10, no. 10, pp. 996– 8, Oct. 2013.
P. D. Schloss, S. L. Westcott, T. Ryabin, J. R. Hall, M. Hartmann, E. B. Hollister, R. A. Lesniewski, B. B. Oakley, D. H. Parks, C. J. Robinson, J. W. Sahl, B. Stres, G. G. Thallinger, D. J. V. Horn, and C. F. Weber, “Introducing mothur: Open-source platform-independent community supported software for describing and comparing microbial communities”, Appl. Envir. Microbiol., vol. 75, no. 23, pp. 7537–7541, 2009.
Y. Sun, Y. Cai, L. Liu, F. Yu, M. L. Farrell, W. McKendree, and W. Farmerie, “ESPRIT: Estimating species richness using large collections of 16S rRNA pyrosequences”, Nucleic Acids Res., vol. 37, no. 10, p. e76, 2009.
Y. Cai and Y. Sun., “ESPRIT-Tree: Hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time” Nucleic Acids Res., vol. 39, no. 14, p. e95, 2011.
R. C. Edgar., “MUSCLE: Multiple sequence alignment with high accuracy and high throughput”, Nucleic Acids Res., vol. 32, no. 5, pp. 1792–1797, 2004.
Y. Sun, Y. Cai, S. M. Huse, et al., “A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis,” Briefings in Bioinformatics, vol. 13, no. 1, pp. 107–121, 2011.
T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: A new data clustering algorithm and its applications”, Data Mining Knowl. Discovery, vol. 1, no. 2, pp. 141–182, 1997.
W. Li and A. Godzik., “Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences”, Bioinformatics, vol. 22, no. 13, pp. 1658–1659, 2006.
Schloss PD, Handelsman J., “Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness” Appl Environ Microbiol 71:1501–1506. http://dx.doi.org/ 10.1128/AEM.71.3.1501, 2005.
Albanese D, Fontana P, De Filippo C, Cavalieri D, Donati C., “Micca: a complete and accurate software for taxonomic profiling of metagenomic data”, Sci Rep 5:9743, http://dx.doi.org/10.1038/srep09743, 2015.
Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M., “Swarm: robust and fast clustering method for amplicon-based studies”, PeerJ 2:e593, http://dx.doi.org/10.7717/peerj.593, 2014.
Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M., “Swarm v2: highly-scalable and high-resolution amplicon clustering”, PeerJ 3:e1420, http://dx.doi.org/10.7717/peerj.1420, 2015.
Kopylova E, Noé L, Touzet H., “SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data”, Bioinformatics 28:3211–3217. http://dx.doi.org/10.1093/bioinformatics/bts611, 2012.
Hobohm U, Scharf M, Schneider R, Sander C., “Selection of representative protein data sets” Protein Sci 1, 409 – 417, http:// dx.doi.org/10.1002/pro.5560010313, 1992.
Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R., “UCHIME improves sensitivity and speed of chimera detection” BioInformatics 27, 2194–2200,
Legendre P, Legendre L., “Numerical ecology”, 2nd ed, Developments in environmental modelling, vol 20, p . Elsevier Science, Amsterdam, The Netherlands, 1998
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.