babel RTT metric false samples
Stephanie Wilde-Hobbs
bird-users at stephanie.is
Thu Apr 11 18:58:23 CEST 2024
Hi,
The babel RTT metric measurements provided by bird appears suspect for
my setup. The metric through a tunnel with a latency of about 5ms is
shown in babel as 150+ms.
Can others replicate this issue? (should be easy to check for other
babel users since RTT measurement is on by default in recent versions)
First I suspected a problem with the tunnel, but I compared bird's babel
RTT measurement against a long-running ping for the same time period and
got ~160ms measured by bird's babel implementation, and 4.6ms with a
28ms maximum latency reported by pings in the same wireguard tunnel.
Other machines across my network also report similarly inflated RTT
metrics for all non-wired links.
Debug logs show many RTT samples with approximately correct timestamps
(4-6ms) then the occasional IHU with 800-1200ms calculated instead.
Calculating the RTT metric by hand using babel packet logs shows that
the calculations are correct. By correlating two packet dumps (the
machines have <1ms NTP timekeeping) I can also see that the packets for
which high RTT is calculated have similar transit times through the
tunnel as other packets. Hence, I suspect the accuracy of the packet
timestamps recorded by bird. Is the current packet timestamping system
giving correct timestamps if the packet arrives while babel is
processing another event?
I can provide packet captures for anyone interested in debugging further.
Thanks,
Stephanie.
More information about the Bird-users
mailing list