A comparative analysis of big data processing paradigms: Mapreduce vs. apache spark

Sifat Ibtisum 1, *, Ehsan Bazgir 2, S M Atikur Rahman 3 and S. M. Saokat Hossain 4

1 Department of Computer Science, Missouri University of Science and Technology, Missouri, USA.

2 Department of Electrical Engineering, San Francisco Bay University, Fremont, CA 94539, USA.

3 Department of Industrial, Manufacturing and Systems Engineering, University of Texas at El Paso, TX 79968, USA.

4 Department of Computer Science, Jahangirnagar University, Dhaka, Bangladesh.

 
Review Article
World Journal of Advanced Research and Reviews, 2023, 20(01), 1089–1098
Article DOI: 10.30574/wjarr.2023.20.1.2174
 
Publication history: 
Received on 16 September 2023; revised on 24 October 2023; accepted on 27 October 2023
 
Abstract: 
The paper addresses a highly relevant and contemporary topic in the field of data processing. Big data is a crucial aspect of modern computing, and the choice of processing framework can significantly impact performance and efficiency. The technical revolution of big data has changed how organizations handle and value large databases. As data quantities expand quickly, effective and scalable data processing systems are essential. MapReduce and Apache Spark are two of the most popular large data processing techniques. This study compares these two frameworks to determine their merits, shortcomings, and applicability for big data applications. Nearly quintillion bytes of data are created daily. Approximately 90% of data was produced in the previous two years. At this stage, data comes from temperature sensors, social media, movies, photographs, transaction records (like banking records), mobile phone conversations, GPS signals, etc. In this article, all key big data technologies are introduced. This document compares all big data technologies and discusses their merits and downsides. Run trials using multiple data sets of varying sizes to validate and explain the study. Graphical depiction shows how one tool outperforms others for given data. Big Data is data generated by the rapid usage of the internet, sensors, and heavy machinery, with great volume, velocity, variety, and veracity. Numbers, photos, videos, and text are omnipresent in every sector. Due to the pace and amount of data generation, the computing system struggles to manage large data. Data is stored in a distributed architectural file system due to its size and complexity. Big distributed file systems, which must be fault-tolerant, adaptable, and scalable, make complicated data analysis dangerous and time-consuming. Big data collection is called ‘datafication’. Big data is ‘datafied’ for productivity. Organisation alone does not make Big Data valuable; we must choose what we can do with it.
 
Keywords: 
SparkR; Spark Core; Apache Spark; MapReduce; Graph X
 
Full text article in PDF: 
Share this