BIRD 3.0-alpha0

Maria Matejka maria.matejka at nic.cz
Fri Mar 4 10:39:22 CET 2022


Hello!

>>> There is also a blogpost about performance comparison we have recently
>>> done between 3.0-alpha0 and 2.0.8 (the previous single-threaded
>>> version) and you can read there also other articles about the
>>> multi-threaded version.
>>
>> … and here is the blogpost.
>>
>> https://en.blog.nic.cz/2022/02/21/bird-journey-to-threads-chapter-3%c2%bd-route-server-performance/ 
> 
> Thanks for the clear writeup. I like to read about this progress (I 
> really should read the dedicated blog about the multithreaded stuff). I 
> can't wait for it to be stable and have a stable 3.0 release. I know 
> this is way off still, but a guy can dream, right? :).

To be honest, writing these blogposts is a really good method to do 
several things at once:

* design documentation
* PR
* rethinking once more whether the implementation is correct
* fixing an awful lot of bugs due to repeated code inspection
* learning even better English

In fact, the stable 3.0 release is not so far from now on. Most of the 
work is already done. The reason for this is simple: BIRD 1.x and 2.x is 
super-optimized to run in single-threaded environment. This has made 
BIRD somehow competitive even when others came with SMP. The drawback of 
these optimizations is that I had to untangle them all at once before 
even attempting to run something in its own thread.

Since the release, we have already discovered two major bugs, both of 
these show pretty well how everything in BIRD is internally entangled 
with everything else.

1. ROA check in filters called while "show route" CLI command. Thanks to 
DE-CIX for reporting this.

There is a major internal deadlock-prevention principle: Locks have a 
partial ordering. A thread may lock if all its current locks are 
strictly greater than the wanted lock.

To do ROA check, you have to lock the ROA table. To do "show route", you 
have to lock the table holding these routes. All table locks are 
incomparable and this call will simply fail.

When regular import and export is done, no table is locked and the 
filter may look into any other table freely. (The export case took a 
serious amount of rework.) I mistakenly thought that CLI could be left 
as-is for now so it simply walks over the table while holding the table 
lock.

This bug can't be fixed easily in any way without a complete rewrite of 
"show route". It is already in plan, no big problem, I'm using this bug 
to illustrate the internal entanglement.

2. MRT dumping. Found internally while discussing BMP merger.

I forgot to think about this at all. Massive MRT dumping may lead to 
race conditions and undebuggable corefiles as there is no locking 
involved between MRT and BGP.

The locking can't be added to the code as-is, it would lead to locking 
priority inversion, therefore MRT must also get some minor rework 
(hopefully not so big as with "show route") to call 3.0 stable.

> Good luck with further optimizing the multithreaded code, this can't be 
> an easy task.

Thank you. The optimization itself is actually of the easier tasks in 
the overall rework. In fact, most of the work was like this:

1. this internal API is thread-unsafe, how to fix it
2. it violates any possible thread-safeness principle, needa rework
3. this unremarkable routine was abusing the old API, goto 2
4. finally the API is OK, let's rework everything around
5. this unremarkable routine causes locking priority inversion
6. this change depends on another change because of something
7. temporary branch and commit, let's do the other change first
8. goto 1

Thankfully, we're (probably) out of this deadloop and the previously 
mentioned bugs shouldn't trigger it back.

Most of the actual optimization has been postponed and I now feel quite 
confident that the worst part (of this journey) is already behind us. 
I'm honestly looking forward to the optimization. It's much fun and 
rewarding to see the results immediately after the change.

To make it crystal clear, 3.0 stable won't be much optimized in the 
algorithmic or data structure way, this will happen later on. Anyway, 
thank you for all your wishes.

Everyone, be safe these times.

Maria


More information about the Bird-users mailing list