Evaluation of Stemming and Stop Word Techniques on Text Classification Problem
Keywords:
Machine Learning, Stemming, Feature SelectionAbstract
Now-a-days a huge amount of information is available over the internet in electronic format. This large amount of data can be analyzed to maximize the benefits, for intelligent decision making. Text categorization is an important and extensively studied problem in machine learning. The basic phases in text categorization include preprocessing features, extracting relevant features against the features in a database, and finally categorizing a set of documents into predefined categories. Most of the researches in text categorization are focusing more on the development of algorithms for optimization of preprocessing technique for text categorization. In this paper we are summarizing the impact of stop word and stemming onto feature selection.
References
K.Aas and A.Eikvil, “Text categorization: A survey", Technical report, Norwegian Computing Center, June, 1999.
Katharina, M. and Martin, S. (2004) the Mining Mart Approach to Knowledge Discovery in Databases, Ning Zhong and Jiming Liu(editors), Intelligent Technologies for Information Analysis, Springer, Pp. 47-65.
T. G. Kolda, D. P. O'Leary, "A semidiscrete matrix decomposition for latent semantic indexing information retrieval", Journal ACM Transactions on Information Systems (TOIS) TOIS Homepage archive vol.16(4), pp. 322-346, Oct. 1998.
G.Salton, C. Buckley, "Term weighting approaches in automatic text retrieval," Inf. Process. Manage. 24, pp. 513–523, 1988.
D. Harman, "Ranking algorithms. In Information Retrieval: Data Structures and Algorithms," W. B. Frakes and R. Baeza-Yates, Eds. Prentice Hall, Englewood Cliffs, NJ, pp.363–392, 1992.
Xue, X. and Zhou, Z. (2009) Distributional Features for Text Categorization, IEEE Transactions on Knowledge and Data Engineering,Vol. 21, No. 3, Pp. 428-442.
Porter, M. (1980) An algorithm for suffix stripping, Program, Vol. 14, No. 3, Pp. 130–137.
Karbasi, S. and Boughanem, M. (2006) Document length normalization using effective level of term frequency in large collections, Advances in Information Retrieval, Lecture Notes in Computer Science, Springer Berlin / Heidelberg, Vol. 3936/2006, Pp.72-83.
Diao, Q. and Diao, H. (2000) Three Term Weighting and Classification Algorithms in Text Automatic Classification, The Fourth International Conference on High-Performance Computing in theAsia-Pacific Region,Vol. 2, P.629.
Chisholm, E. and Kolda, T.F. (1998) New term weighting formulas for the vector space method in information retrieval, TechnicalReport, Oak Ridge National Laboratory.
Sharma Dharmendra, jain suresh, “Content sharing in information storage and retrieval system using tree representation of documents”,IEEE ,International conference on IT industry, business and government,CSIBIG2014 page 1-4,2014
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.