Data science pipelines in lakehouse architectures: A scalable approach to big data analytics
Independent Researcher, Cloud Computing, Columbus OH, USA.
Research Article
World Journal of Advanced Research and Reviews, 2022, 16(03), 1412–1425
Publication history:
Received on 18 October 2022; revised on 19 December 2022; accepted on 26 December 2022
Abstract:
The exponential growth of data generation across industries has necessitated the development of sophisticated architectures capable of handling diverse data types while maintaining analytical agility. This paper presents a comprehensive framework for implementing end-to-end data science pipelines within lakehouse architectures, bridging the gap between traditional data warehouses and data lakes. The proposed methodology leverages the unified storage and processing capabilities of lakehouse systems to create scalable, reproducible, and maintainable data science workflows that support both exploratory analytics and production machine learning deployments.
Our research introduces a novel modular pipeline framework that seamlessly integrates data engineering and data science operations through containerized microservices architecture. The framework incorporates advanced metadata management systems for comprehensive data lineage tracking and implements cloud-native automation layers that dynamically scale computational resources based on workload demands. Through systematic evaluation of performance metrics and real-world case studies, we demonstrate significant improvements in pipeline execution time, resource utilization efficiency, and model deployment velocity compared to traditional architectures.
The lakehouse paradigm enables data scientists to perform complex analytics on raw, semi-structured, and structured data without the traditional extract-transform-load bottlenecks that characterize conventional data warehouse approaches. By combining Apache Spark's distributed processing capabilities with Databricks' collaborative analytics platform and MLflow's model lifecycle management, our framework provides a comprehensive solution for enterprise-scale data science operations. Experimental results indicate up to 60% reduction in time-to-insight and 40% improvement in computational resource efficiency compared to legacy pipeline architectures.
Keywords:
Lakehouse Architecture; Data Science Pipelines; Apache Spark; Mlflow; Metadata Management; Cloud Computing; Machine Learning Operations
Full text article in PDF:
Copyright information:
Copyright © 2022 Author(s) retain the copyright of this article. This article is published under the terms of the Creative Commons Attribution Liscense 4.0
