babel RTT metric false samples

Wed Aug 14 17:55:47 CEST 2024

Hi,

I've been running that branch since Jul 24, I've had no issues and it's 
run stably apart from a logged assertion failure:

Assertion 'cs == CS_DOWN || cs == CS_START' failed at nest/proto.c:1147

Unfortunately I can't speak for it's effect on Babel RTT since I have 
moved only a single machine to that branch, to avoid depending on babel 
3 in the firing line of my home internet.

Hope this helps validate the threaded babel implementation.

Stephanie.

On 13/04/2024 16:14, Maria Matejka wrote:
> Hello Stephanie, Toke and list,
> 
> On Fri, Apr 12, 2024 at 04:22:50PM +0200, Toke Høiland-Jørgensen via 
> Bird-users wrote:
> 
>     Stephanie Wilde-Hobbs via Bird-users bird-users at network.cz
>     <mailto:bird-users at network.cz> writes:
> 
>         The babel RTT metric measurements provided by bird appears
>         suspect for my setup. The metric through a tunnel with a latency
>         of about 5ms is shown in babel as 150+ms.
> 
> […]
> 
>         Debug logs show many RTT samples with approximately correct
>         timestamps (4-6ms) then the occasional IHU with 800-1200ms
>         calculated instead. Calculating the RTT metric by hand using
>         babel packet logs shows that the calculations are correct. By
>         correlating two packet dumps (the machines have <1ms NTP
>         timekeeping) I can also see that the packets for which high RTT
>         is calculated have similar transit times through the tunnel as
>         other packets. Hence, I suspect the accuracy of the packet
>         timestamps recorded by bird. Is the current packet timestamping
>         system giving correct timestamps if the packet arrives while
>         babel is processing another event?
> 
>     Hmm, so Babel implementation in Bird tries to get a timestamp as
>     early as possible after receiving the packet, and set it as late as
>     possible before sending out the packet. However, the former in
>     practice means after returning from poll(), so if the packet has
>     been sitting around in the OS buffer for a while before Bird gets
>     around to process it, the timestamp is not set until Bird is done
>     processing it. Likewise, if the packet sits around in a socket
>     buffer (or in a lower-level buffer on the sending side) after Bird
>     has sent it out, that time will also be counted as part of the RTT.
> 
> I would suspect that the kernel table prune routine may be the case. It 
> just runs from begin to end synchronously.
> 
> I have just fast-tracked Babel in its own thread for BIRD 3, it may be 
> worth checking. (There should be also artifacts from the build process 
> for download available.) This should get you rid of most of the cases of 
> suspiciously high RTT.
> 
> |https://gitlab.nic.cz/labs/bird/-/tree/babel-in-threads|
> 
> Just to be noted, updating a route in BIRD 3 is still a locking process 
> so it may still tamper the RTT measurements. At least it should happen 
> only in cases where Babel is doing the update. Anyway, with BIRD 3 
> internals, it should be possible to easily /detect/ such situations and 
> disregard these single measurements as unreliable. (Not implemented, 
> though.)
> 
> There are even some thoughts on implementing lockless import queues for 
> routing tables, yet now we have to prioritize BIRD 3 stabilization to 
> actually release it as a stable version. Import queues must wait.
> 
> Also with this testing, feel free to report any weird behavior, notably 
> crashes of BIRD 3, as bugs. That would be very helpful with stabilizing 
> BIRD 3. Thanks a lot!
> 
> Maria
> 
> – Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o.
>