Evaluation of Stemming and Stop Word Techniques on Text Classification Problem

Dharmendra Sharma; Suresh  Jain

Authors

Dharmendra Sharma Mewar University, Chittorgarh, Rajasthan, India
Suresh Jain Mewar University, Chittorgarh, Rajasthan, India

Keywords:

Machine Learning, Stemming, Feature Selection

Abstract

Now-a-days a huge amount of information is available over the internet in electronic format. This large amount of data can be analyzed to maximize the benefits, for intelligent decision making. Text categorization is an important and extensively studied problem in machine learning. The basic phases in text categorization include preprocessing features, extracting relevant features against the features in a database, and finally categorizing a set of documents into predefined categories. Most of the researches in text categorization are focusing more on the development of algorithms for optimization of preprocessing technique for text categorization. In this paper we are summarizing the impact of stop word and stemming onto feature selection.

References

K.Aas and A.Eikvil, “Text categorization: A survey", Technical report, Norwegian Computing Center, June, 1999.

Katharina, M. and Martin, S. (2004) the Mining Mart Approach to Knowledge Discovery in Databases, Ning Zhong and Jiming Liu(editors), Intelligent Technologies for Information Analysis, Springer, Pp. 47-65.

T. G. Kolda, D. P. O'Leary, "A semidiscrete matrix decomposition for latent semantic indexing information retrieval", Journal ACM Transactions on Information Systems (TOIS) TOIS Homepage archive vol.16(4), pp. 322-346, Oct. 1998.

G.Salton, C. Buckley, "Term weighting approaches in automatic text retrieval," Inf. Process. Manage. 24, pp. 513–523, 1988.

D. Harman, "Ranking algorithms. In Information Retrieval: Data Structures and Algorithms," W. B. Frakes and R. Baeza-Yates, Eds. Prentice Hall, Englewood Cliffs, NJ, pp.363–392, 1992.

Xue, X. and Zhou, Z. (2009) Distributional Features for Text Categorization, IEEE Transactions on Knowledge and Data Engineering,Vol. 21, No. 3, Pp. 428-442.

Porter, M. (1980) An algorithm for suffix stripping, Program, Vol. 14, No. 3, Pp. 130–137.

Karbasi, S. and Boughanem, M. (2006) Document length normalization using effective level of term frequency in large collections, Advances in Information Retrieval, Lecture Notes in Computer Science, Springer Berlin / Heidelberg, Vol. 3936/2006, Pp.72-83.

Diao, Q. and Diao, H. (2000) Three Term Weighting and Classification Algorithms in Text Automatic Classification, The Fourth International Conference on High-Performance Computing in theAsia-Pacific Region,Vol. 2, P.629.

Chisholm, E. and Kolda, T.F. (1998) New term weighting formulas for the vector space method in information retrieval, TechnicalReport, Oak Ridge National Laboratory.

Sharma Dharmendra, jain suresh, “Content sharing in information storage and retrieval system using tree representation of documents”,IEEE ,International conference on IT industry, business and government,CSIBIG2014 page 1-4,2014

Evaluation of Stemming and Stop Word Techniques on Text Classification Problem

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Make a Submission

Journal Information

Information

Join Editorial Board

Keywords

Current Issue