Improving Clustering Accuracy using Feature Extraction Method

T. SenthilSelvi; R. Parimala

Authors

T. SenthilSelvi Department of Computer Science, Periyar E.V.R College, Trichy-23, India
R. Parimala Department of Computer Science, Periyar E.V.R College, Trichy-23, India

Keywords:

Clustering, Euclidean Distance, Document frequency, Dimensionality reduction, Principal components

Abstract

Clustering is the technique employed to group documents containing related information into clusters, which facilitates the allocation of relevant information. Clustering performance is mostly dependent on the text document features. The first challenge concerns difficulty with identifying significant term features to represent original content by considering the hidden knowledge. The second challenge is related to reducing data dimensionality without losing essential information. Clustering techniques were proposed to use feature extraction methods Principal Component Analysis (PCA) and Kernel Principal Component Analysis (KPCA) to improve the clustering efficiency and quality. Documents are pre-processed, converted to vector space model and then clustered using the proposed algorithm. The goal of this work is to design a suitable model for clustering text document that is capable of improving clustering performance. In this paper, the problems are discussed with empirical evidence. Experimental results show that the proposed method is effective for the text clustering task.

References

C. Boutsidis, M.W. Mahoney, P. Drineas,"Unsupervised Feature Selection for the k-Means clustering problem", In the NIPS`09 Proceedings of the 22nd International Conference on Neural Information Processing Systems, Canada, pp.153-161, 2009.

D. Greene, P. Cunningham, "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", In the 23rd International Conference on Machine Learning, Pittsburgh, PA, pp.377-384, 2006.

Z. Miner, L. Csat, "Kernel PCA Based Clustering for Inducing Features in text Categorization", In the ESANN`2007 Proceedings- European Symposium on Artificial Neural Networks, Bruges,Belguim., pp.349-354, 2007.

R. Mall, J.A.K.Suykens,"Kernel Spectral Document Clustering Using Unsupervised Precision-Recall Metrics.", 2015 International Joint Conference on Neural Network,Killarney, Ireland, pp. 1-7, 2015.

R. Jenssen, T.Eltoft, M.Girolami and D. Erdogmus ,"Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm.", In the NIPS`06 Proceedings of the 19th International Conference on Neural Information Processing Systems, Canada, pp.633-640, 2006.

T. Shi, M. Belkin, B. Yu, “Data spectroscopy: eigenspaces of convolution operators and clustering”, The Annals of Statistics, Vol. 37, No.6B, pp.3960-3984, 2009.

L.Kaufmann, “Advances in Kernel Methods — Support Vector Learning -Solving the quadratic programming problem arising in support vector classification, MIT Press, Cambridge, MA, pp.147–168, 1999.

Y.Yang, J.O. Pedersen, "A Comparative study of feature selection in Text Categorization", In the Proceedings of the Fourteenth International Conference on Machine Learning (ICML`97), USA, pp.412-420, 1997.

I. Feinerer, K.Hornik, D. Meyer, "Text Mining Infrastructure in R”, Journal of Statistical Software, Vol.25, Issue 5, pp.1-54, 2008..

A. Karatzoglou, A. Smola, K. Hornik, A. Zeileis, "kernlab - An S4 Package for Kernel Methods in R", Journal of Statistical Software Vol.11, Issue 9, pp.1-20, 2004.

Improving Clustering Accuracy using Feature Extraction Method

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

Information

Join Editorial Board

Keywords

Current Issue