Indepth Analysis of Medical Dataset Mining: A Comparitive Analysis on a Diabetes Dataset Before and After Preprocessing


Most of the healthcare organizations and medical research institutions store their patient’s data digitally for future references and for planning their future treatments. This heterogeneous medical dataset is very difficult to analyze due to its complexity and volume of data, in addition to having missing values and noise which makes this mining a tedious task. Efficient classification of medical dataset is a major data mining problem then and now. Diagnosis, prediction of diseases and the precision of results can be improved if relationships and patterns from these complex medical datasets are extracted efficiently. This paper analyses some of the major classification algorithms such as C4.5 ( J48), SMO, Naïve Bayes, KNN Classification algorithms and Random Forest and the performance of these algorithms are compared using WEKA. Performance evaluation of these algorithms is based on Accuracy, Sensitivity and Specificity and Error rate. The medical data set used in this study are Heart-Statlog Medical Data Set which holds medical data related to heart disease and Pima Diabetes Dataset which holds data related to Diabetics. This study contributes in finding the most suitable algorithm for classifying medical data and also reveals the importance of preprocessing in improving the classification performance. Comparative study of various performances of machine learning algorithms is done through graphical representation of the results.

Keywords: Data Mining, Health Care, Classification Algorithms, Accuracy, Sensitivity, Specificity, Error Rate

[1] Ms. A. Malarvizhi, Dr. S. Ravichandran,” Data Mining’s Role in Mining Medical Datasets for Disease Assessments – a Case Study”, International Journal of Pure and Applied Mathematics, Volume 119 No. 12 2018.

[2] P. Jaganathan and R. Kuppuchamy, “A threshold fuzzy entropy based feature selection for medical database classification,” Computers in Biology and Medicine, vol. 43, no. 12, pp. 2222–2229, 2013.

[3] Umair Shafique, Haseeb Qaiser,”A Comparative Study of Data Mining Process Models”, - International Journal of Innovation and Scientific Research, Vol. 12 No. 1, Nov. 2014.

[4] Dr. T. Karthikeyan, Dr. B. Ragavan, V.A.Kanimozhi, A Study on Data mining Classification Algorithms in Heart Disease Prediction, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) Volume 5, Issue 4, April 2016.

[5] Yong ZENG, H.-m. F.-p.-y. (2016). An Improved ML-kNN Algorithm by Fusing Nearest Neighbor Classification”. International Conference on Artificial Intelligence and Computer Science (AICS 2016).

[6] Shaifali Gupta, R. R. (2016). Improvement in KNN Classifier (imp-KNN) for Text Categorization”,. International Journal of Advanced Research in Computer Science and Software Engineering, Volume 6.

[7] Vikas Chaurasia and Saurabh Pa, Performance analysis of Diagnosis and Prediction of Heart and Breast Cancer Disease, Review of research, Vol 3,Issue 3 May 2014.

[8] N. Amma, “Cardiovascular disease prediction system using genetic algorithm and neural network,” in International Conference on Computing, Communication and Applications. Dindigul, Tamilnadu, India:IEEE, Feb 2012, pp. 1–5.

[9] W. Wiharto, H. Kusnanto, and H. Herianto, “Performance analysis of multiclass support vector machine classification for diagnosis of coronary heart diseases,” International Journal on Computational Science & Applications, vol. 5, no. 5, pp. 27–37, 2015.

[10] Jaganathan P., Kuppuchamy R. A threshold fuzzy entropy based feature selection for medical database classification. Computers in Biology and Medicine. 2013;43(12).

[11] C. V. Subbulakshmi and S. N. Deepa, Medical Dataset Classification: A Machine Learning Paradigm Integrating Particle Swarm Optimization with Extreme Learning Machine Classifier, Scientific World Journal, September 2015.

[12] Ms. Ishtake S.H, Prof. Sanap S.A. “Intelligent Heart Disease Prediction System Using Data Mining Techniques”, International J. of Healthcare & Biomedical Research, Volume: 1, Issue: 3, April 2013.

[13] Chaitrali S. Dangare Sulabha, “ Improved Study of Heart Disease Prediction System using Data Mining Classification Techniques”, International Journal of Computer Applications (0975 – 888) Volume 47– No.10, June 2012.

[14] Jyoti Soni, Ujma Ansari, Dipesh Sharma, “ Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction”, International Journal of Computer Applications (0975 – 8887) Volume 17– No.8, March 2011.

[15] AH Chen, SY Huang, PS Hong, CH Cheng, EJ lin, “HDPS: Heart Disease Prediction System, Computing in cardiology”, 2011: 38:557- 560.

[16] Vikas Chaurasia, Saurabh Pal, “ Early Prediction of Heart Diseases using Data Mining Techniques”, Caribbean Journal of Science & Technology, ISSN 0799-3757.

[17] Andrea D’Souza, “Heart Disease Prediction Using Data Mining Techniques”, International Journal of Research in Engineering and Science (IJRES) ISSN (Online): 2320-9364, ISSN (Print): 2320-9356.

[18] Milan Kumari, Sunila Godara, “ Comparative Study of Data Mining Classification Methods in Cardiovascular Disease Prediction”, International Journal of Computer Science and Technology, IJCST Vol. 2, Issue 2, June 2011.

[19] Abhishek Taneja, “Heart Disease Prediction System Using Data Mining Techniques”, Oriental Journal Of Computer Science & Technology, ISSN: 0974-6471 December 2013, Vol. 6, No. (4).

[20] Sellappan Palaniappan, Rafiah Awang, “Intelligent Heart Disease Prediction System Using Data Mining Techniques” IJCSNS International.

[21] K. Saravananathan1 and T. Velmurugan, Analyzing Diabetic Data using Classification Algorithms in Data Mining, Indian Journal of Science and Technology, Vol 9(43).

[22] Saman Hina, Anita Shaikh and Sohail Abul Sattar, Analyzing Diabetes Datasets using Data Mining, Journal of Basic & Applied Sciences, 2017, 13, 466-471.

[23] Aiswarya Iyer, S. Jeyalatha and Ronak Sumbaly, DIAGNOSIS OF DIABETES USING CLASSIFICATION MINING TECHNIQUES, International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2015.

[24] R. Sivanesan, K. Devika Rani Dhivya, A Review on Diabetes Mellitus diagnoses using classification on Pima Indian Diabetes Data Set, International Journal of Advance Research in Computer Science and Management Studies, Volume 5, Issue 1, January 2017.

[25] J.Anitha, Dr.A.Pethalakshmi, Comparison of Classification Algorithms in Diabetic Dataset, International Journal of Information Technology (IJIT) – Volume 3 Issue 3, May-Jun 2017.

[26] Meraj Nabi, Pradeep Kumar, Abdul Wahid, Performance Analysis of Classification Algorithms in Predicting Diabetes, International Journal of Advanced Research in Computer Science Volume 8, No. 3, March –April 2017.

[27] A. K. Santra, C. Josephine Christy,” Genetic Algorithm and Confusion Matrix for Document Clustering”, IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 2, January 2012.

[28] World Health Organization,“Cardiovascular diseases (CVDS)”, detail/cardiovascular-diseases-(cvds), May 2017.

[29] Leif E. Peterson,“ K-nearest neighbor”, Scholarpedia, 2009.

[30] UCI Machine Learning, (