Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java

Hoger Khayrolla Omar; Alaa Khalil Jumaa

doi:10.24017/science.2019.1.2

Authors

Hoger Khayrolla Omar Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani | Kirkuk University, Kirkuk, Iraq
Alaa Khalil Jumaa Database Technology Department, Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani, Iraq

Abstract

Nowadays with the technology revolution the term of big data is a phenomenon of the decade moreover, it has a significant impact on our applied science trends. Exploring well big data tool is a necessary demand presently. Hadoop is a good big data analyzing technology, but it is slow because the Job result among each phase must be stored before the following phase is started as well as to the replication delays. Apache Spark is another tool that developed and established to be the real model for analyzing big data with its innovative processing framework inside the memory and high-level programming libraries for machine learning, efficient data treating and etc. In this paper, some comparisons are presented about the time performance evaluation among Scala and Java in apache spark MLlib. Many tests have been done in supervised and unsupervised machine learning methods with utilizing big datasets. However, loading the datasets from Hadoop HDFS as well as to the local disk to identify the pros and cons of each manner and discovering perfect reading or loading dataset situation to reach best execution style. The results showed that the performance of Scala about 10% to 20% is better than Java depending on the algorithm type. The aim of the study is to analyze big data with more suitable programming languages and as consequences gaining better performance.

Keywords:

Keywords: Big data, Data analysis, Apache Spark, Hadoop HDFS, Machine learning, Spark MLlib, Resilient Distributed Datasets(RDD).

References

[1] N. Ibrahim, A. Hassan , M. Nihad, "Big data analysis of web data Extraction," International Journal of Engineering & Technology, vol. 7 (4.37), p. 168, 2018.
https://doi.org/10.14419/ijet.v7i4.37.24095
[2] "Government Office for Science. The Internet of Things: making . 2014. The Internet of Things: making," 2014. [Online]. Available: https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/409774/14-1230-internet-of-things-review.pdf. [Accessed 1 3 2019].
[3] A. K. Jumaa, "Secured Data Conversion, Migration, Processing and Retrieval between SQL Database and NoSQL Big Data," DIYALA Journal for pure sciences, vol. 14, no. 4, p. 68, October 2018.
https://doi.org/10.24237/djps.1404.449A
[4] G. Bello-Orgaz, J. J. Jung, D. Camacho, "Social big data: Recent achievements and new challenges," Elsevier B.V Inf. Fusion, Vols. 28,, p. 45-59, 2016.
https://doi.org/10.1016/j.inffus.2015.08.005
[5] K. AL-BARZNJI, A. ATANASSOV, "A SURVEY OF BIG DATA MINING: CHALLENGES AND TECHNIQUES," in Proceedings of 24th International Symposium "Control of Energy, Industrial and Ecological Systems, Bankia, Bulgaria, May 2016.
[6] R. JANOŠCOVÁ, "Mining Big Data in WEKA," in 11th IWKM , Bratislava, Slovakia, October 20 - 21, 2016.
[7] H. Özkösea, E. Ari, Cevriye Gencer, "Yesterday, Today and Tomorrow of Big Data," Procedia - Social and Behavioral Sciences, vol. 195, p. 1043, 2015.
https://doi.org/10.1016/j.sbspro.2015.06.147
[8] "Apache Hadoop," Apache software foundation, [Online]. Available: https://hadoop.apache.org/. [Accessed 6 3 2019].
[9] N. Pandey, Rajeshwari S, S. Rani BN, Mrs. Mounica B, "A Comparison on Hadoop and Spark," International Journal of Innovative Research in Computer and Communication Engineering, vol. 6, no. 3, p. 2062, March 2018.
[10] Y. Perwej, B. Kerim, M. Sirelkhtem, Osama E. Sheta, "An Empirical Exploration of the Yarn in Big Data," International Journal of Applied Information Systems (IJAIS), vol. 12, p. 19, December 2017 .
https://doi.org/10.5120/ijais2017451730
[11] T. Ruzgas, K. Jakub?lien?, A. Buivyt?, "Big Data Mining and Knowledge Discovery," Journal of Communications Technology, Electronics and Computer Science, no. 9, p. 7, 2016.
https://doi.org/10.22385/jctecs.v9i0.134
[12] Apache software foundation, 24 feb 2019. [Online]. Available: http://spark.apache.org/.
[13] D. García-Gil , S. Ramírez-Gallego, S. García, F. Herrera, "A comparison on scalability for batch big data processing on Apache Spark and Apache Flink," Big Data Analytics, p. 3, 2017.
https://doi.org/10.1186/s41044-016-0020-2
[14] F. Parwej,N. Akhtar,Dr. Y. Perwej, "A Close-Up View About Spark in Big Data Jurisdiction," V. Surekha. Int. Journal of Engineering Research and Application www.ijera.com, vol. 8, no. 1, p. 31, January 2018.
[15] T. Kumawat, P. Kumar Sharma, D. Verma, K. Joshi, V. Kumawat, "Implementation of Spark Cluster Technique with Scala," International Journal of Scientific and Research Publications, vol. 2, no. 11, p. 501, November 2012.
[16] D. U. R. Pol, "Big Data Analysis: Comparision of Hadoop MapReduce and Apache," IJESC, vol. 6, no. 6, p. 6390, 2016.
[17] B. Kaluža, Machine Learning in Java, UK: Packt Publishing Ltd, 2016.
[18] S. García*, S. Ramírez-Gallego, J. Luengo, J. Manuel Benítez, F. Herrera, "Big data preprocessing: methods and prospects," Big Data Analytics, p. 9, 2016.
https://doi.org/10.1186/s41044-016-0014-0
[19] H. Sayed, M. A. Abdel-Fattah, S. Kholief, "Predicting Potential Banking Customer Churn using Apache Spark ML and MLlib Packages: A Comparative Study," (IJACSA) International Journal of Advanced Computer Science and Applications, vol. 9, pp. 674-677, Nov 2018.
https://doi.org/10.14569/IJACSA.2018.091196
[20] S. Al-Saqqaa,b, G. Al-Naymata, A.Awajan, "A Large-Scale Sentiment Data Classification for Online Reviews Under Apache Spark," in The 9th International Conference on Emerging Ubiquitous Systems and Pervasive Networks, EUSPN Belgium, 2018.
https://doi.org/10.1016/j.procs.2018.10.166
[21] K. AL-BARZNJI, A. ATANASSOV, "BIG DATA SENTIMENT ANALYSIS USING MACHINE LEARNING ALGORITHMS," in Proceedings of 26th International Symposium "Control of Energy, Industrial and Ecological Systems, Bankia, Bulgaria, may 2018.
[22] M. Assefi, E. Behravesh, G. Liu, and A. P. Tafti, "Big Data Machine Learning using Apache Spark," in 2017 IEEE International Conference on Big Data , Boston, MA, USA, 11-14 Dec. 2017.
https://doi.org/10.1109/BigData.2017.8258338
[23] S. Salloum, R. Dautov, X. Chen1, P. Xiaogang Peng, J. Zhexue Huang, "Big data analytics on Apache Spark," Int J Data Sci Anal -Springer International Publishing Switzerland, September 2016.
https://doi.org/10.1007/s41060-016-0027-9
[24] A. Shoro, T. Rahim Soomro, "Big Data Analysis: Ap Spark Perspective," Global Journal of Computer Science and Technology: C Software & Data Engineering, vol. 15, no. 1, pp. 7-14, 2015.
[25] A. Bansod, "Efficient Big Data Analysis with Apache Spark in HDFS," International Journal of Engineering and Advanced Technology (IJEAT), vol. 4, no. 6, pp. 313-315, August 2015.
[26] M. mohit, R. Ranjan Verma, S. Katoch , A. Vanjare, S N Omkar, "Classification of Complex UCI Datasets Using Machine Learning Algorithms Using Hadoop," International Journal of Computer Science and Software Engineering (IJCSSE), vol. 4, no. 7, pp. 190-198, July 2015.
[27] S. Gopalani, R. Arora, "Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means," International Journal of Computer Applications (0975 - 8887), vol. 113, pp. 8-11, March 2015.
https://doi.org/10.5120/19788-0531
[28] "Uci machine learning repository," [Online]. Available: http://archive.ics.uci.edu/ml/index.html. [Accessed 26 2 2019].