Autonomous data engineering: Reinforcement learning-driven metadata management in cloud-native data ecosystems
Enterprise Infrastructure, Truist Financial Corporation, USA.
Research Article
World Journal of Advanced Research and Reviews, 2024, 24(03), 3568-3582
Publication history:
Received on 12 November 2024; revised on 21 December 2024; accepted on 28 December 2024
Abstract:
Managing metadata at scale in distributed Extract, Transform, Load (ETL) ecosystems present significant challenges including schema drift, source-target mapping inconsistencies, and error propagation across data pipelines. This paper introduces a novel reinforcement learning-based autonomous metadata management framework that dynamically adapts to schema evolution, optimizes source-target mapping configurations, and implements self-correcting mechanisms for data quality anomalies. The proposed system leverages deep Q-networks (DQN) and policy gradient methods to continuously learn from historical ingestion patterns, schema change events, and anomaly occurrences within modern cloud-native data platforms. Implementation utilizing Snowflake Data Cloud, Databricks Unified Analytics Platform, and Amazon Web Services (AWS) storage services demonstrates the framework's effectiveness across heterogeneous data environments. Experimental validation conducted on Truist Financial Corporation's enterprise data lakes shows a 67% reduction in manual metadata correction efforts and 40% improvement in data availability Service Level Agreements (SLAs), while maintaining 99.7% data accuracy across distributed data pipelines processing over 2.5 petabytes of financial data monthly.
Keywords:
Reinforcement Learning; Metadata Management; Data Engineering; Schema Evolution; Cloud Analytics; Automated Data Pipelines
Full text article in PDF:
Copyright information:
Copyright © 2024 Author(s) retain the copyright of this article. This article is published under the terms of the Creative Commons Attribution Liscense 4.0
