Data science pipelines in lakehouse architectures: A scalable approach to big data analytics

Praveen Kumar Reddy Gujjala

doi:10.30574/wjarr.2022.16.3.1305

Praveen Kumar Reddy Gujjala^*

Independent Researcher, Cloud Computing, Columbus OH, USA.

Research Article

World Journal of Advanced Research and Reviews, 2022, 16(03), 1412-1425

Article DOI: 10.30574/wjarr.2022.16.3.1305

DOI url: https://doi.org/10.30574/wjarr.2022.16.3.1305

Publication history

Received on 18 October 2022; revised on 19 December 2022; accepted on 26 December 2022

Abstract

The exponential growth of data generation across industries has necessitated the development of sophisticated architectures capable of handling diverse data types while maintaining analytical agility. This paper presents a comprehensive framework for implementing end-to-end data science pipelines within lakehouse architectures, bridging the gap between traditional data warehouses and data lakes. The proposed methodology leverages the unified storage and processing capabilities of lakehouse systems to create scalable, reproducible, and maintainable data science workflows that support both exploratory analytics and production machine learning deployments.

Our research introduces a novel modular pipeline framework that seamlessly integrates data engineering and data science operations through containerized microservices architecture. The framework incorporates advanced metadata management systems for comprehensive data lineage tracking and implements cloud-native automation layers that dynamically scale computational resources based on workload demands. Through systematic evaluation of performance metrics and real-world case studies, we demonstrate significant improvements in pipeline execution time, resource utilization efficiency, and model deployment velocity compared to traditional architectures.

The lakehouse paradigm enables data scientists to perform complex analytics on raw, semi-structured, and structured data without the traditional extract-transform-load bottlenecks that characterize conventional data warehouse approaches. By combining Apache Spark's distributed processing capabilities with Databricks' collaborative analytics platform and MLflow's model lifecycle management, our framework provides a comprehensive solution for enterprise-scale data science operations. Experimental results indicate up to 60% reduction in time-to-insight and 40% improvement in computational resource efficiency compared to legacy pipeline architectures.

Keywords

Lakehouse Architecture; Data Science Pipelines; Apache Spark; Mlflow; Metadata Management; Cloud Computing; Machine Learning Operations

Download Article PDF

https://wjarr.com/sites/default/files/fulltext_pdf/WJARR-2022-1305.pdf

Preview Article PDF

How to cite this article

Praveen Kumar Reddy Gujjala. Data science pipelines in lakehouse architectures: A scalable approach to big data analytics. World Journal of Advanced Research and Reviews, 2022, 16(3), 1412-1425. Article DOI: https://doi.org/10.30574/wjarr.2022.16.3.1305

Data science pipelines in lakehouse architectures: A scalable approach to big data analytics

Praveen Kumar Reddy Gujjala^*

Preview Article PDF

Get Certificates

Issue details

Data science pipelines in lakehouse architectures: A scalable approach to big data analytics

Praveen Kumar Reddy Gujjala *

Preview Article PDF

Get Certificates

Issue details

Praveen Kumar Reddy Gujjala^*