babel RTT metric false samples
Maria Matejka
maria.matejka at nic.cz
Sat Apr 13 16:14:01 CEST 2024
Hello Stephanie, Toke and list,
On Fri, Apr 12, 2024 at 04:22:50PM +0200, Toke Høiland-Jørgensen via
Bird-users wrote:
> Stephanie Wilde-Hobbs via Bird-users <bird-users at network.cz> writes:
> > The babel RTT metric measurements provided by bird appears suspect
> > for my setup. The metric through a tunnel with a latency of about
> > 5ms is shown in babel as 150+ms.
[...]
> > Debug logs show many RTT samples with approximately correct
> > timestamps (4-6ms) then the occasional IHU with 800-1200ms
> > calculated instead. Calculating the RTT metric by hand using babel
> > packet logs shows that the calculations are correct. By correlating
> > two packet dumps (the machines have <1ms NTP timekeeping) I can also
> > see that the packets for which high RTT is calculated have similar
> > transit times through the tunnel as other packets. Hence, I suspect
> > the accuracy of the packet timestamps recorded by bird. Is the
> > current packet timestamping system giving correct timestamps if the
> > packet arrives while babel is processing another event?
>
> Hmm, so Babel implementation in Bird tries to get a timestamp as early
> as possible after receiving the packet, and set it as late as possible
> before sending out the packet. However, the former in practice means
> after returning from poll(), so if the packet has been sitting around
> in the OS buffer for a while before Bird gets around to process it,
> the timestamp is not set until Bird is done processing it. Likewise,
> if the packet sits around in a socket buffer (or in a lower-level
> buffer on the sending side) after Bird has sent it out, that time will
> also be counted as part of the RTT.
I would suspect that the kernel table prune routine may be the case. It
just runs from begin to end synchronously.
I have just fast-tracked Babel in its own thread for BIRD 3, it may be
worth checking. (There should be also artifacts from the build process
for download available.) This should get you rid of most of the cases of
suspiciously high RTT.
https://gitlab.nic.cz/labs/bird/-/tree/babel-in-threads
Just to be noted, updating a route in BIRD 3 is still a locking process
so it may still tamper the RTT measurements. At least it should happen
only in cases where Babel is doing the update. Anyway, with BIRD 3
internals, it should be possible to easily *detect* such situations and
disregard these single measurements as unreliable. (Not implemented,
though.)
There are even some thoughts on implementing lockless import queues for
routing tables, yet now we have to prioritize BIRD 3 stabilization to
actually release it as a stable version. Import queues must wait.
Also with this testing, feel free to report any weird behavior, notably
crashes of BIRD 3, as bugs. That would be very helpful with stabilizing
BIRD 3. Thanks a lot!
Maria
--
Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://trubka.network.cz/pipermail/bird-users/attachments/20240413/2623e804/attachment.htm>
More information about the Bird-users
mailing list