Enhancing distributed stream processing: novel keygrouping methods and their comparative analysis

Caulier, Camille

Files

Caulier_24701800_2024.pdf

UCLouvain restricted access
Adobe PDF
5.09 MB

Details

Supervisors: Riviere, Etienne ; Schmitz, Donatien ; Alexandre, Da Silva Veith
Faculty: Ecole polytechnique de Louvain
Degree label: Master [120] : ingénieur civil en informatique, à finalité spécialisée
Abstract: This master’s thesis explores three new hybrid key grouping methods designed to optimise stream partitioning in distributed stream processing systems, applied via the Apache Flink engine. Traditional key grouping methods, such as round-robin and Hash Partitioning, have played crucial roles in distributed computing but often fall short in scenarios involving skewed workloads. This research proposes an innovative key grouping technique that aims to focus on throughput whilst maintaining aggregation costs low on a skewed workload. Its secondary aim is to compare the different state-of-the-art key grouping methods. Implemented and tested within the Apache Flink framework, the first new method, hybrid setup aims to surpass the limitations of Round Robin, Hash Partitioning and other state-of-the-art key grouping methods by splitting the received keys into two groups, popular and non-popular that are then routed to two different operators. Popular keys undergo a round-robin key grouping while the non-popular keys undergo hash partitioning. The second method, hashRoundRobin, builds on the logic of the first, with all keys sent to the same operator but handled differently based on their popularity. Popular keys are managed using round-robin key grouping, while non-popular keys are grouped using Hash Partitioning. The third method, CAMRoundRobin, follows the same approach as the second but employs the cAM key grouping method proposed by Nikos et al. [14]. Comparative experiments conducted as part of this study evaluate the effectiveness of the proposed methods against conventional and state-of-the-art techniques by analysing the throughput and aggregation costs. The findings from this thesis reveal that the new key grouping methods improve partitioning efficiency and aggregation costs, compared to its counterparts. These improvements are critical for applications requiring high levels of throughput. The study contributes to practical insights into the varying effects of skew and window sizes can have on the key grouping methods cited.

ATTENTION/WARNING - NE PAS DÉPOSER ICI/DO NOT SUBMIT HERE

Enhancing distributed stream processing: novel keygrouping methods and their comparative analysis

Files

Caulier_24701800_2024.pdf

Details