OTU Clustering: A window to analyse uncultured microbial world

Ashaq Hussain Bhat; Puniethaa Prabhu

Authors

Ashaq Hussain Bhat Department of Biotechnology K. S. Rangasamy College of Technology, Tiruchengode, India
Puniethaa Prabhu Department of Biotechnology K. S. Rangasamy College of Technology, Tiruchengode, India

Keywords:

16S rDNA, OTUs, Uclust, SUMACLUST, SortMeRNA, USEARCH, taxonomic profiling

Abstract

Clustering is the technique used to deal with higher amounts of data by partitioning the data into some groups based on some attributes. Clustering technique has many applications in different fields of science and technology. It is an important tool in genomics and metagenomics which performs taxonomic profiling of the microbial world by grouping 16S RDNA amplicon reads into clusters called as Operational Taxonomic Units (OTUs). With the help of Next Generation Sequencing (NGS) tools and clustering it has become easy for scientists to find the microbial diversities in different environments without culturing the microbes. Assignment of 16s rDNA sequences to the clusters called as OTUs is the main task in metagenomics algorithms and is also the main bottleneck for analysing microbial communities. Taxonomic profiling of 16S rDNA is an important step in Metagenomic pipeline analysis. There are several OTU clustering algorithms which clusters the amplicon reads of 16S rDNA into OTUs, each algorithm use a specific type of clustering technique to cluster the sequence reads. Some of the mostly used algorithms are Uclust, swarm, SUMACLUST, SortMeRNA, USEARCH. In this paper, we first give a brief overview of major clustering techniques and their types. Furthermore, we provide a comprehensive overview of OTU clustering algorithms.

References

P. D’haeseleer, “How does gene expression clustering work?” Nat. Biotechnol., vol. 23, pp. 1499–501, 2005.

N. D. Heintzman, G. C. Hon, R. D. Hawkins, P. Kheradpour, A. Stark, L. F. Harp, Z. Ye, L. K. Lee, R. K. Stuart, and C. W. Ching, “Histone modifications at human enhancers reflect global celltype- specific gene expression,” Nature, vol. 459, no. 7243, pp. 108–112, 2009.

R. K. Chodavarapu, S. Feng, Y. V. Bernatavichute, P.-Y. Chen, H. Stroud, Y. Yu, J. a. Hetzel, F. Kuo, J. Kim, S. J. Cokus, D. Casero, M. Bernal, P. Huijser, A. T. Clark, U. Kramer, S. S. Merchant, X. Zhang, S. E. Jacobsen, and M. Pellegrini, “Relationship between nucleosome positioning and DNA methylation,” Nature, vol. 466, pp. 388–92, 2010.

X. Wang, G. O. Bryant, M. Floer, D. Spagna, and M. Ptashne, “An effect of DNA sequence on nucleosome occupancy and removal,” Nat. Publishing Group, vol. 18, pp. 507–509, 2011.

A. S. Shirkhorshidi, S. Aghabozorgi, T. Y. Wah, T. Herawan, “Big Data Clustering: A Review” Computational Science and Its Applications – ICCSA 2014Volume 8583 of the series Lecture Notes in Computer Science pp 707-720.

M. L. Sogin, H. G. Morrison, J. A. Huber, D. Mark Welch, S. M. Huse, P. R. Neal, J. M. Arrieta, and G. J. Herndl, “Microbial diversity

in the deep sea and the underexplored ‘rare biosphere’”, Proc. Nat. Acad. Sci. USA, vol. 103, no. 32, pp. 12115– 12120, 2006.

S. M. Huse, D. M. Welch, H. G. Morrison, and M. L. Sogin. (2010).Ironing out the wrinkles in the rare biosphere through improved OTU clustering,” Environmental Microbiol., vol. 12, no. 7, pp. 1889–1898.

J. G. Caporaso, J. Kuczynski, J. Stombaugh, K. Bittinger, F. D. Bushman, E. K. Costello, N. Fierer, A. G. Pena, J. K. Goodrich, J. I. Gordon, G. A. Huttley, S. T. Kelley, D. Knights, J. E. Koenig, R. E. Ley, C. A. Lozupone, D. McDonald, B. D. Muegge, M. Pirrung, J. Reeder, J. R. Sevinsky, P. J. Turnbaugh, W. A. Walters, J. Widmann, T. Yatsunenko, J. Zaneveld, and R. Knight, “QIIME allows analysis of high-throughput community sequencing data,” Nature Methods, vol. 7, no. 5, pp. 335–336, May 2010.

R. C. Edgar. (2010). “Search and clustering orders of magnitude faster than BLAST” Bioinformatics, vol. 26, no. 19, pp. 2460–2461.

R. C. Edgar, “UPARSE: highly accurate OTU sequences from microbial amplicon reads,” Nat. Methods, vol. 10, no. 10, pp. 996– 8, Oct. 2013.

P. D. Schloss, S. L. Westcott, T. Ryabin, J. R. Hall, M. Hartmann, E. B. Hollister, R. A. Lesniewski, B. B. Oakley, D. H. Parks, C. J. Robinson, J. W. Sahl, B. Stres, G. G. Thallinger, D. J. V. Horn, and C. F. Weber, “Introducing mothur: Open-source platform-independent community supported software for describing and comparing microbial communities”, Appl. Envir. Microbiol., vol. 75, no. 23, pp. 7537–7541, 2009.

Y. Sun, Y. Cai, L. Liu, F. Yu, M. L. Farrell, W. McKendree, and W. Farmerie, “ESPRIT: Estimating species richness using large collections of 16S rRNA pyrosequences”, Nucleic Acids Res., vol. 37, no. 10, p. e76, 2009.

Y. Cai and Y. Sun., “ESPRIT-Tree: Hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time” Nucleic Acids Res., vol. 39, no. 14, p. e95, 2011.

R. C. Edgar., “MUSCLE: Multiple sequence alignment with high accuracy and high throughput”, Nucleic Acids Res., vol. 32, no. 5, pp. 1792–1797, 2004.

Y. Sun, Y. Cai, S. M. Huse, et al., “A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis,” Briefings in Bioinformatics, vol. 13, no. 1, pp. 107–121, 2011.

T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: A new data clustering algorithm and its applications”, Data Mining Knowl. Discovery, vol. 1, no. 2, pp. 141–182, 1997.

W. Li and A. Godzik., “Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences”, Bioinformatics, vol. 22, no. 13, pp. 1658–1659, 2006.

Schloss PD, Handelsman J., “Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness” Appl Environ Microbiol 71:1501–1506. http://dx.doi.org/ 10.1128/AEM.71.3.1501, 2005.

Albanese D, Fontana P, De Filippo C, Cavalieri D, Donati C., “Micca: a complete and accurate software for taxonomic profiling of metagenomic data”, Sci Rep 5:9743, http://dx.doi.org/10.1038/srep09743, 2015.

Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M., “Swarm: robust and fast clustering method for amplicon-based studies”, PeerJ 2:e593, http://dx.doi.org/10.7717/peerj.593, 2014.

Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M., “Swarm v2: highly-scalable and high-resolution amplicon clustering”, PeerJ 3:e1420, http://dx.doi.org/10.7717/peerj.1420, 2015.

Kopylova E, Noé L, Touzet H., “SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data”, Bioinformatics 28:3211–3217. http://dx.doi.org/10.1093/bioinformatics/bts611, 2012.

Hobohm U, Scharf M, Schneider R, Sander C., “Selection of representative protein data sets” Protein Sci 1, 409 – 417, http:// dx.doi.org/10.1002/pro.5560010313, 1992.

Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R., “UCHIME improves sensitivity and speed of chimera detection” BioInformatics 27, 2194–2200,

Legendre P, Legendre L., “Numerical ecology”, 2nd ed, Developments in environmental modelling, vol 20, p . Elsevier Science, Amsterdam, The Netherlands, 1998

OTU Clustering: A window to analyse uncultured microbial world

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

Information

Join Editorial Board

Keywords

Current Issue