Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java

Hoger Khayrolla Omar; Alaa Khalil Jumaa

doi:10.24017/science.2019.1.2

Authors

Hoger Khayrolla Omar Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani | Kirkuk University, Kirkuk, Iraq Author
Alaa Khalil Jumaa Database Technology Department, Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani, Iraq Author

DOI:

https://doi.org/10.24017/science.2019.1.2

Keywords:

Keywords: Big data, Data analysis, Apache Spark, Hadoop HDFS, Machine learning, Spark MLlib, Resilient Distributed Datasets(RDD).

Abstract

Nowadays with the technology revolution the term of big data is a phenomenon of the decade moreover, it has a significant impact on our applied science trends. Exploring well big data tool is a necessary demand presently. Hadoop is a good big data analyzing technology, but it is slow because the Job result among each phase must be stored before the following phase is started as well as to the replication delays. Apache Spark is another tool that developed and established to be the real model for analyzing big data with its innovative processing framework inside the memory and high-level programming libraries for machine learning, efficient data treating and etc. In this paper, some comparisons are presented about the time performance evaluation among Scala and Java in apache spark MLlib. Many tests have been done in supervised and unsupervised machine learning methods with utilizing big datasets. However, loading the datasets from Hadoop HDFS as well as to the local disk to identify the pros and cons of each manner and discovering perfect reading or loading dataset situation to reach best execution style. The results showed that the performance of Scala about 10% to 20% is better than Java depending on the algorithm type. The aim of the study is to analyze big data with more suitable programming languages and as consequences gaining better performance.

Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

Similar Articles

Templates

Indexing and Listing