Optimizing Real-Time Data Pipelines for Machine Learning: A Comparative Study of Stream Processing Architectures

Raveendra Reddy Pasala; Mohan Raja Pulicharla; Varsha Premani

doi:10.30574/wjarr.2024.23.3.2818

Optimizing Real-Time Data Pipelines for Machine Learning: A Comparative Study of Stream Processing Architectures

Raveendra Reddy Pasala ^{1, *}, Mohan Raja Pulicharla ² and Varsha Premani³

¹Department of Computer Science, India.

² Monad University, India.

Review Article

World Journal of Advanced Research and Reviews, 2024, 23(03), 1653–1660

Article DOI: 10.30574/wjarr.2024.23.3.2818

DOI url: https://doi.org/10.30574/wjarr.2024.23.3.2818

Publication history:

Received on 04 August 2024; revised on 11 September 2024; accepted on 13 September 2024

Abstract:

Within the time of enormous information and real-time analytics, optimizing information pipelines for machine learning is basic for convenient and exact bits of knowledge. This consideration analyzes the execution and versatility of Apache Kafka Streams, Apache Flink, and Apache Pulsar in real-time machine-learning applications. In spite of the wide use of these innovations, there's a need for comprehensive comparative examination with respect to their productivity in commonsense scenarios. This inquiry about addresses this crevice by giving a point-by-point comparison of these systems, centering on idleness, throughput, and asset utilization.

We conducted benchmarks and tests to assess each framework's execution in taking care of high-throughput information, conveying real-time expectations, and overseeing asset utilization. Our conclusion uncovered that Apache Flink accomplishes a 25% lower end-to-end idleness compared to Kafka Streams in high-throughput scenarios. Apache Pulsar exceeds expectations in adaptability, handling up to 1.5 million messages per moment, whereas Kafka Streams appears 15% higher memory utilization.

These discoveries highlight the qualities and impediments of each system. Kafka Streams coordinate well with Kafka's informing framework but may have higher idleness beneath overwhelming loads. Flink offers prevalent low-latency and high-throughput execution, making it reasonable for complex assignments. Pulsar's progressed informing highlights and versatility are promising for large-scale applications, though it requires cautious tuning. This comparative investigation gives down-to-earth bits of knowledge for choosing the ideal stream preparation system for machine learning pipelines.

Keywords:

Real-time ML pipelines; Kafka Streams performance; Flink vs Kafka latency; High-throughput stream processing; Pulsar scalability ML; Stream processing comparison

Full text article in PDF:

Click here

Optimizing Real-Time Data Pipelines for Machine Learning: A Comparative Study of Stream Processing Architectures

Raveendra Reddy Pasala 1, *, Mohan Raja Pulicharla 2 and Varsha Premani 3

Raveendra Reddy Pasala ^{1, *}, Mohan Raja Pulicharla ² and Varsha Premani³