linkerd viz stat-outbound reports incorrect latencies #13483

kflynn · 2024-12-13T22:41:45Z

What is the issue?

linkerd viz stat-outbound seems to report latencies that are many times larger than actual latencies.

How can it be reproduced?

Get a new cluster, then

linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -
linkerd viz install | kubectl apply -f -
linkerd check

kubectl create ns faces
kubectl annotate ns/faces linkerd.io/inject=enabled

helm install faces -n faces \
     oci://ghcr.io/buoyantio/faces-chart --version 2.0.0-rc.2 \
     --set gui.serviceType=LoadBalancer \
     --set face.errorFraction=0 \
     --set backend.errorFraction=0 \
     --set backend.delayBuckets="1000"

kubectl rollout status -n faces deploy

At this point, we need access to the faces-gui Service in the faces namespace. Though I can access it directly via its external IP in my setup, I'll write this repro using kubectl port-forward.

kubectl port-forward -n faces svc/faces-gui 8080:80

Using --set backend.delayBuckets=1000 when installing Faces means that the backend workloads (smiley and color) will always delay every call 1000ms. We can verify this with curl:

curl 'http://localhost:8080/face/center/?row=2&col=2' | jq

This will take about a second, and its output will include a latency element that will be around 1000ms. We can check this a few times:

for i in 1 2 3 4 5 6 7 8 9 10; do
    curl -s 'http://localhost:8080/face/center/?row=2&col=2' | jq .latency
done

This will take about 10 seconds, and should show a stack of numbers around 1000.

Go ahead and open a web browser to http://localhost:8080 and you'll see the Faces GUI, which we'll use as a traffic generator. Then run

watch linkerd viz stat-outbound -n faces deploy/face

and you'll see something like this (after possibly giving it a chance to warm up):

NAME  SERVICE    ROUTE      TYPE  BACKEND    SUCCESS    RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES
face  color:80   [default]                   100.00%   8.00       5500ms       9550ms       9910ms     0.00%    0.00%
                 └─────────────►  color:80   100.00%   8.00       5500ms       9550ms       9910ms     0.00%
face  smiley:80  [default]                   100.00%   8.00       5500ms       9550ms       9910ms     0.00%    0.00%
                 └─────────────►  smiley:80  100.00%   8.00       5500ms       9550ms       9910ms     0.00%

Ignore the 100% success rate for color (it's a gRPC service and we have no GRPCRoutes at present) and look at the latencies. All of these numbers should be right at 1000ms, but they're not?

Even more interesting, if we switch the backend latencies to 100ms:

kubectl set env -n faces deploy/smiley DELAY_BUCKETS=100
kubectl set env -n faces deploy/color DELAY_BUCKETS=100
kubectl rollout status -n faces deploy

then the for loop above will run it about one second, and show numbers right around 100, but linkerd viz stat-outbound -n faces deploy/face will (after it settles down) show things like this:

NAME  SERVICE    ROUTE      TYPE  BACKEND    SUCCESS    RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES
face  smiley:80  [default]                   100.00%   8.00        275ms        478ms        496ms     0.00%    0.00%
                 └─────────────►  smiley:80  100.00%   8.00        175ms        242ms        248ms     0.00%
face  color:80   [default]                   100.00%   8.00        275ms        478ms        496ms     0.00%    0.00%
                 └─────────────►  color:80   100.00%   8.00        175ms        242ms        248ms     0.00%

which is even weirder -- why the distinction between the different rows?

Logs, error output, etc

See above.

output of `linkerd check -o short`

:; linkerd check -o short
Status check results are √

Environment

MacOS 15.1.1
k3d version 5.7.4
k3s version 1.30.4-k3s1
Linkerd edge-24.11.8
Faces 2.0.0-rc.2 😇

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

maybe

The text was updated successfully, but these errors were encountered:

adleong · 2024-12-13T23:29:03Z

This is an artifact of the latency histogram buckets that Linkerd uses for these metrics.

https://github.com/linkerd/linkerd2-proxy/blob/main/linkerd/app/outbound/src/http/logical/policy/route/metrics.rs#L104

In this case, the relevant latency histogram bucket bounds is [1s, 10s]. When the Linkerd proxy observes a latency that is slightly larger than 1s, it records it into this bucket where it becomes indistinguishable from any other value between 1s and 10s. Then, when performing latency quantile calculations, we would get a value like 1.5s for p50 (which, as you point out, is much larger than the actual latency value).

These types of histogram artifacts can always occur when recorded latency values are near a bucket boundary and the magnitude of the error depends on the size of the buckets. If Linkerd used finer grained buckets here, the magnitude of the error would be smaller but this would come at a tradeoff with memory and timeseries storage.

Additionally, this test exhibited "worst case" behavior because a latency of 1s was artificially chosen which happens to be right on a bucket boundary. Real world latencies are less likely to be clustered directly on bucket boundaries. Repeating this experiment with backend.delayBuckets="913" gives more accurate (i.e. less inaccurate) percentiles:

linkerd viz stat-outbound -n faces deploy/face 
NAME  SERVICE    ROUTE      TYPE  BACKEND    SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES  
face  color:80   [default]                   100.00%  2.37        750ms        975ms        995ms     0.00%    0.00%  
                 └─────────────►  color:80   100.00%  2.37        750ms        975ms        995ms     0.00%           
face  smiley:80  [default]                   100.00%  2.37        750ms        975ms        995ms     0.00%    0.00%  
                 └─────────────►  smiley:80  100.00%  2.37        750ms        975ms        995ms     0.00%

adleong · 2024-12-13T23:32:36Z

For context on our bucket size choices, see linkerd/linkerd2-proxy#3086

kflynn added the bug label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linkerd viz stat-outbound reports incorrect latencies #13483

linkerd viz stat-outbound reports incorrect latencies #13483

kflynn commented Dec 13, 2024

adleong commented Dec 13, 2024

adleong commented Dec 13, 2024

linkerd viz stat-outbound reports incorrect latencies #13483

linkerd viz stat-outbound reports incorrect latencies #13483

Comments

kflynn commented Dec 13, 2024

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

adleong commented Dec 13, 2024

adleong commented Dec 13, 2024

output of `linkerd check -o short`