babel RTT metric false samples
Stephanie Wilde-Hobbs
bird-users at stephanie.is
Wed Aug 14 17:55:47 CEST 2024
Hi,
I've been running that branch since Jul 24, I've had no issues and it's
run stably apart from a logged assertion failure:
Assertion 'cs == CS_DOWN || cs == CS_START' failed at nest/proto.c:1147
Unfortunately I can't speak for it's effect on Babel RTT since I have
moved only a single machine to that branch, to avoid depending on babel
3 in the firing line of my home internet.
Hope this helps validate the threaded babel implementation.
Stephanie.
On 13/04/2024 16:14, Maria Matejka wrote:
> Hello Stephanie, Toke and list,
>
> On Fri, Apr 12, 2024 at 04:22:50PM +0200, Toke Høiland-Jørgensen via
> Bird-users wrote:
>
> Stephanie Wilde-Hobbs via Bird-users bird-users at network.cz
> <mailto:bird-users at network.cz> writes:
>
> The babel RTT metric measurements provided by bird appears
> suspect for my setup. The metric through a tunnel with a latency
> of about 5ms is shown in babel as 150+ms.
>
> […]
>
> Debug logs show many RTT samples with approximately correct
> timestamps (4-6ms) then the occasional IHU with 800-1200ms
> calculated instead. Calculating the RTT metric by hand using
> babel packet logs shows that the calculations are correct. By
> correlating two packet dumps (the machines have <1ms NTP
> timekeeping) I can also see that the packets for which high RTT
> is calculated have similar transit times through the tunnel as
> other packets. Hence, I suspect the accuracy of the packet
> timestamps recorded by bird. Is the current packet timestamping
> system giving correct timestamps if the packet arrives while
> babel is processing another event?
>
> Hmm, so Babel implementation in Bird tries to get a timestamp as
> early as possible after receiving the packet, and set it as late as
> possible before sending out the packet. However, the former in
> practice means after returning from poll(), so if the packet has
> been sitting around in the OS buffer for a while before Bird gets
> around to process it, the timestamp is not set until Bird is done
> processing it. Likewise, if the packet sits around in a socket
> buffer (or in a lower-level buffer on the sending side) after Bird
> has sent it out, that time will also be counted as part of the RTT.
>
> I would suspect that the kernel table prune routine may be the case. It
> just runs from begin to end synchronously.
>
> I have just fast-tracked Babel in its own thread for BIRD 3, it may be
> worth checking. (There should be also artifacts from the build process
> for download available.) This should get you rid of most of the cases of
> suspiciously high RTT.
>
> |https://gitlab.nic.cz/labs/bird/-/tree/babel-in-threads|
>
> Just to be noted, updating a route in BIRD 3 is still a locking process
> so it may still tamper the RTT measurements. At least it should happen
> only in cases where Babel is doing the update. Anyway, with BIRD 3
> internals, it should be possible to easily /detect/ such situations and
> disregard these single measurements as unreliable. (Not implemented,
> though.)
>
> There are even some thoughts on implementing lockless import queues for
> routing tables, yet now we have to prioritize BIRD 3 stabilization to
> actually release it as a stable version. Import queues must wait.
>
> Also with this testing, feel free to report any weird behavior, notably
> crashes of BIRD 3, as bugs. That would be very helpful with stabilizing
> BIRD 3. Thanks a lot!
>
> Maria
>
> – Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o.
>
More information about the Bird-users
mailing list