Improving accuracy of missing data imputation in data mining

Nzar A. Ali; Zhyan M. Omer

doi:10.24017/science.2017.3.30

Authors

Nzar A. Ali Computer Dept, Cihan University, Sulaimani, Iraq
Zhyan M. Omer Statistics and Informatics University of Sulaimani, Sulaimani, Iraq

Abstract

In fact, raw data in the real world is dirty. Each large data repository contains various types of anomalous values that influence the result of the analysis, since in data mining, good models usually need good data, databases in the world are not always clean and includes noise, incomplete data, duplicate records, inconsistent data and missing values. Missing data is a common drawback in many real-world data sets. In this paper, we proposed an algorithm depending on improving (MIGEC) algorithm in the way of imputation for dealing missing values. We implement grey relational analysis (GRA) on attribute values instead of instance values, and the missing data were initially imputed by mean imputation and then estimated by our proposed algorithm (PA) used as a complete value for imputing next missing value.We compare our proposed algorithm with several other algorithms such as MMS, HDI, KNNMI, FCMOCS, CRI, CMI, NIIA and MIGEC under different missing mechanisms. Experimental results demonstrate that the proposed algorithm has less RMSE values than other algorithms under all missingness mechanisms.

Keywords:

Data mining; Missing value; Missing value ; Data preprocessing

References

[1] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth ,(1996) "From data mining to knowledge discovery", American Association for Artificial Intelligence, San Francisco, Vol. 17, No. 3.
[2] R. Nisbet , J. Elder, G. Miner , (2009) "Handbook of Statistical Analysis and Data Mining Applications". Academic Press, Boston.
[3] J. Han and M. Kamber, (2011) "Data Mining: Concepts and Techniques", Morgan Kaufmann,San Francisco .
[4] J. Tian , B. Yu , D. Yu , Sh. Ma , (2014) "A hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering" Appl Intell 40:376-388, DOI 10.1007/s10489-013-0469- x, Springer Science+Business Media New York.
https://doi.org/10.1007/s10489-013-0469-x
[5] X.Y. Zhou , J. S. Lim , (2014) "Replace Missing Values with EM algorithm based on GMM and Naïve Bayesian" International Journal of Software Engineering and Its Applications Vol.8, No.5, pp.177-188.
[6] O. B. Shukur, M.H. Lee , (2015) "Imputation of Missing Values in Daily Wind Speed Data Using Hybrid AR-ANN Method ", Published by Canadian Center of Science and Education ,Modern, ISSN 1913-1844 E-ISSN 1913-1852, Applied Science; Vol. 9, No. 11 .
https://doi.org/10.5539/mas.v9n11p1
[7] J.Barnard, X.Meng, (1999) "Applications of multiple imputation in medical studies: from aids to nhanes", Stat. Methods Med. Res. 8(1), 17-36 .
https://doi.org/10.1177/096228029900800103
[8] J.A. Boyko, (2013) "Handling Data with Three Types of Missing Values" , Ph.D. Thesis , University of Connecticut .
[9] R.J.A. Little, D.B. Rubin, (1987) "Statistical Analysis with Missing Data" , 1st edn. Wiley Series in Probability and Statistics, New York.
[10] S .Zhang , (2011) "Shell-neighbor method and its application in missing data imputation ", Appl Intell 35(1):123-133.
https://doi.org/10.1007/s10489-009-0207-6