So, I was going down a ton of rabbit holes for distributed computing instead of doing my homework and I spotted something neat, where someone mentioned that Kafka has tail latency problems. I thought that it was interesting, so I researched a bit more and found this apparently applies to all (most) streaming tools.

What I found

So, I first I found a case study with Allegro basically Amazon Poland, where they found that their Kafka median response times were single-digit milliseconds, and their p99 latency was up to 1 second, and p999 was up by 3 seconds. Personally, I think it’s hilarious being that unlucky user waiting 3 seconds for what usually takes a fraction of that.

https://blog.allegro.tech/2024/03/kafka-performance-analysis.html

But wait it gets worse:

The more I looked I realized this just wasn’t a Kafka issue, I guess this was everyone’s problem.

According to Confluent’s own benchmarks, even in optimized conditions: they mentioned “The higher your target percentile, the more tuning is needed to either minimize or account for the worst-case behavior of your application.” Which basically means good luck fixing this without a massive effort.

https://www.confluent.io/blog/configure-kafka-to-minimize-latency/

Okay but what about Apache Pulsar? Someone mentioned that Pulsar’s 99th percentile…mentioned that Pulsar’s 99th percentile latency is within the range of 5 and 15 milliseconds" but at p99 for 1 partition, “the latency of Pulsar is 52.958 milliseconds, while the latency of Kafka is almost 4 times that of Pulsar, which is 201.701 milliseconds”

https://segmentfault.com/a/1190000040977781/en

Lastly, I remember how Google also wrote an entire paper admitting “It is challenging to keep the tail of the latency distribution low for interactive services as the size and complexity of the system scales up”

https://research.google/pubs/the-tail-at-scale/

But why?

I figured yes there is a ton of research and papers showing there is awful tail latency at p99 and so on, but the question is why?

So first I found Marc Brooker’s analysis where he mentioned that “1 service has 1% change of hitting tail latency, and 10 services has a 10% change, and well 100 services it gets ugly fast.”

As he mentioned with N=10, around 10% of the time. The tail mode, which used to be quite rare, starts to dominate as N increases.

Another study I found that with just 8 components in a chain, there’s a 33% chance of hitting tail latency. The paper notes: “The trend is obvious, the more components in a call chain we have, the more likely the overall system response time is affected by tail latencies.”

https://brooker.co.za/blog/2021/04/19/latency.html https://andrewpakhomov.com/posts/latency-tail-latency-and-response-time-in-distributed-systems/

So, the question again is why?

According to multiple sources, the usual suspects are network issues (which can’t really be avoided looking at you CAP.), Slow dependencies, Garbage collection pauses, and resource contention in data centers.

Some interesting things I also found.

Even though a small fraction of requests experiences these extreme latencies, it tends to affect your most profitable users. These users tend to be the ones that make the highest number of requests and thus have a higher chance of experiencing tail latencies, according to Roberto Vitillo. Which personally makes sense because your most profitable users are more likely to have more services than the average user.

https://robertovitillo.com/why-you-should-measure-tail-latencies/

But yes, this is well known, such as Amazon mention Every 100ms of latency costs them 1% in sales, and Netflix tracks P99.9 latency because even a tiny percentage means thousands of users not being to access the site.

The joke

As I mentioned “Kafka’s tail latency sucks.,” However after reading (skimming) dozens of papers and random blog posts. This really isn’t a Kafka or streaming problem. Rather it’s just a fundamental property of distributed systems at scale. (Tail at Scale) where “Temporary high latency episodes which are unimportant in moderate size systems may come to dominate overall service performance at large scale.”

TLDR: Everyone’s tail latency sucks.

P.S. Please do read the papers I linked they are much more worth reading and describe it more succinctly than this blog. Also, if I made some mistakes and incorrection assumptions do please tell me. I can be contacted at leungke@oregonstate.edu