The Necessity of Exploratory Data Analysis: How are preprocessing activities beneficial to Data Analysts and Professional Researchers in Academia?

Ismail Olaniyi Muraina; Olayemi Muyideen Adesanya; Moses Adeolu Agoi; Solomon Onen Abam

Authors

Ismail Olaniyi Muraina Department of Computer Science, Lagos State University of Education, Lagos, Nigeria
Olayemi Muyideen Adesanya Department of Computer Science, Lagos State University of Education, Lagos, Nigeria
Moses Adeolu Agoi Department of Computer Science, Lagos State University of Education, Lagos, Nigeria
Solomon Onen Abam Department of Computer Science, Federal College of Education Technical, Ebonyi, Nigeria

Keywords:

Data Analysts, Exploratory Data Analysis, Dataset, Tools, Statistical Summary, Preprocessing

Abstract

Data analysis is used in all academic disciplines. Still, research has shown that some studies can appear ambiguous when the analyzed data needs to be sufficiently illustrated to identify trends, patterns, and other assumptions. These assumptions typically enable researchers to present the statistical summary using pertinent and self-explanatory graphical representations. Analysts want to use a method that will assist them in condensing the dataset`s critical characteristics for straightforward interpretation and presentation to the audience. In addition to presenting the impact of preprocessing activities in assuring an error-free dataset before actual analysis is done, this study uncovers the trick to conducting an adequate investigation on the dataset to have a clean dataset for accurate analysis interpretation. The most popular preprocessing procedures, including missing values, outliers, and variable transformation, are listed. The study used a descriptive survey design technique and focused on using a questionnaire instrument to gather data from respondents using a Google Forms App. The information was collected using criteria such as gender, amount of data analysis experiences, institution type, and roles within the academic community. Both face validity and construct validity methods were used to validate the instrument. Chrobach`s Alpha yielded a dependability index of 0.88, indicating good reliability. Since the data was prepared and collected online using Google Forms, the data collecting and collation process only took four days. Software for appropriate visualization was used for the analysis. The results demonstrated that thoroughly exploring the data and removing any bias or outliers is the first step that any data analyst must take before starting a proper analysis process. The usage of some of the tools available for cleaning datasets was also outlined, and it was recommended that amateur analysts take the time to learn how to utilize them. Before beginning their final year projects, final-year undergraduate and postgraduate students should be exposed to all exploratory data analysis methods.

References

Kayode A. Okewale, Ifedotun R. Idowu, Bamidele S. Alobalorun, Falilat A. Alabi, "Effective Machine Learning Classifiers for Intrusion Detection in Computer Network", International Journal of Scientific Research in Computer Science and Engineering, Vol.11, Issue.2, pp.14-22, 2023

R.S. Walse, G.D. Kurundkar, P. U. Bhalchandra, "A Review: Design and Development of Novel Techniques for Clustering and Classification of Data", International Journal of Scientific Research in Computer Science and Engineering, Vol.06, Issue.01, pp.19-22, 2018

Song, X. “A Brief Introduction to Exploratory Data Analysis”. Advances in Engineering Technology Research, Vol.01 Issue 01, 2023

Miller, Ryan (2019). Data Preprocessing: What is it, and why is it important? C-Suite Agenda, Vol.01 Issue 01, 2019.

V.K. Gujare, P. Malviya, "Big Data Clustering Using Data Mining Technique", International Journal of Scientific Research in Computer Science and Engineering, Vol.5, Issue.2, pp.9-13, 2017.

Bhandari, Pritha “Missing Data: Types, Explanation, & Imputation”, 2021.

Jim Freeman “Outliers in Statistical Data (3rd edition)”, Journal of the Operational Research Society Vol. 46 Issue. 08, 1995

Charu Aggarwal “An Introduction to Outlier Analysis” Outlier Analysis Publisher, London, 2017

Natalja Verina & Jelena Titko, “Digital transformation: a conceptual framework, ” In the Proceedings of 2019 Contemporary Issues in Business, Management and Economics Engineering, Vilnius, Lithuania, 720-727, 2019

Manikandan S “Data transformation,” Journal of Pharmacology and Pharmacotherapeutics Vol. 01, Issue 02, pp. 126-7, 2010

A. Singh, N. Jain, "Internet Surfing Prediction System using Association Rule Mining based on FP-Growth", International Journal of Scientific Research in Computer Science and Engineering, Vol.4, Issue.4, pp.1-6, 2016.

Mishra, S., Sarkar, U., Taraphder, S., Datta, S., Swain, D., & Saikhom, R. et al. “Multivariate Statistical Data Analysis- Principal Component Analysis (PCA)”. International Journal of Livestock Research, Vol. 07, Issue 05, pp. 60-78, 2017.

Ledisi G. Kabari & Believe B. Nwamae “Principal Component Analysis (PCA) - An Effective Tool in Machine Learning,” International Journals of Advanced Research in Computer Science and Software Engineering Vol. 09, Issue 05, pp. 56-59, 2019.

Pushpa Singh, Narendra Singh, Krishna Kant Singh, Akansha Singh “Diagnosing of disease using machine learning”, Editor(s): Krishna Kant Singh, Mohamed Elhoseny, Akansha Singh, Ahmed A. Elngar, Machine Learning and the Internet of Medical Things in Healthcare, Academic Press, 89-111, 2021, https://doi.org/10.1016/B978-0-12-821229-5.00003-3.

Amir H. Alavi, Maria Q. Feng, Pengcheng Jiao, Zahra Sharif-Khodaei “Advanced sensing and monitoring systems for smart cities”, Editor(s): Amir H. Alavi, Maria Q. Feng, Pengcheng Jiao, Zahra Sharif-Khodaei. The Rise of Smart Cities, Butterworth-Heinemann, pp. 1-26, 2022,

Federico, Zuecco, Massimiliano Barolo “Computer Aided Chemical Engineering. 30th European Symposium on Computer-Aided Process Engineering. 30th European Symposium on Computer Aided Chemical Engineering”, Volume 47 contains the papers presented at the 30th European Symposium of Computer Aided Process Engineering (ESCAPE) event held in Milan, Italy, May 24-27, Vol. 48, pp. 1-2068, 2020.

Arnab Chakrabarty, Tahir Cagin “Inherently Safer Design in Multiscale Modeling for Process Safety Applications”. Butterworth-Heinemann publisher, pp. 397-406, 2016,

Girish Kumar Adari, Maheswari Raja, P. Vijaya “Machine learning in genomics: identification and modelling of anticancer peptides”, Editor(s): Amit Kumar Tyagi, Ajith Abraham, Data Science for Genomics, Academic Press, pp. 25-68, 2023,

Misra, S., Li, H., & He, J. “Robust geomechanical characterization by analyzing the performance of shallow-learning regression methods using unsupervised clustering methods. In Machine Learning for Subsurface Characterization”, 2020,

Rohini Selvaraj, Nagarajan “Change detection techniques for a remote sensing application: An overview”, Editor(s): Yu-Dong Zhang, Arun Kumar Sangaiah, In Cognitive Data Science in Sustainable Computing, Cognitive Systems and Signal Processing in Image Processing, Academic Press, pp. 129-143, 2022.

The Necessity of Exploratory Data Analysis: How are preprocessing activities beneficial to Data Analysts and Professional Researchers in Academia?

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

Information

Join Editorial Board

Keywords

Current Issue