bird under heavy cpu load

Mon Mar 26 01:25:32 CEST 2012

On Mon, Mar 12, 2012 at 11:22:10PM +0400, Oleg wrote:
> On Mon, Mar 12, 2012 at 02:27:23PM +0400, Alexander V. Chernikov wrote:
> > On 12.03.2012 13:25, Oleg wrote:
> > >Hi, all.
> > >
> > >I have some experience with bird under heavy cpu load. I had a
> > >situation when bird do frequent updates of kernel table, because
> > >of bgp-session frequent down/up (because of cpu and/or net load).

Hello

Answering collectively for the whole thread:

I did some preliminary testing and it on my test machine exporting full
BGP feed (cca 400k routes) to a kernel table took 1-2 sec on Linux and
5-6 sec on BSD. Similar time for flushing the kernel table. Therefore,
if we devote a half CPU for kernel sync, we have about 200 kr/s (kiloroutes
per second) for Linux and 40 kr/s for BSD, this still seems more than
enough for an edge router. Are there any estimates (using protocol statistics)
for number of updates to kernel proto in this case? How many protocols,
tables and ppie do you have in your case?

The key to responsiveness (and ability to send keepalives on time)
during heavy CPU load is in granularity. The main problem in BIRD is
that whole route propagation is done synchronously - when route is
received, it is propagated through all pipes and all routing tables to
all final receivers in one step, which is problematic if you have
several hundreds of BGP sessions (but probably not too problematic with
regard to kernel sync). If this could be splitted and done
asynchronously in some smart way, it could solve several problems. One
idea is that routes (or nets) in a table would be connected in a list in
the order as they arrived to the table, each protocol would have a
position in that list for routes not yet exported to that protocol and
event in event queue handling that export. Another advantage of that
approach is that we could temporarily stop or rate limit propagation of
routes to that protocol. 

But i noticed recently that there are some steps that have even bigger
latency than just plain route propagation. For example disabling kernel
protocol would cause flushing all routes in kernel table done in one step,
which (as mentioned above) will block BIRD for cca 5 s on BSD with 400k
routes. Fortunately, disabling kernel protocol is not a common operation.

Another problem is protocol flushing - if protocol fell down, all its
routes are de-propagated in one step (function rt_prune()). This is
probably the most important cause of latencies and could be easily
fixed.

Another possible problem is a kernel scan, which is also done
in one step, but at least in Linux it could probably be splitted to
smaller steps and does not took too much time if the kernel table is
in expected state.

I would probably implement some latency measurement and do some more
testing to get a better idea and probably fix the protocol flushing
problem.

BTW, one possible hack how to spare CPU time under heavy load with
regard to kernel syncing is to disable synchronous kernel updates in
krt_notify() and rely completely on syncing during periodic scan.

BTW, regardless of all of this, for BFD we would definitely need a
separate process/thread, but with almost none shared data.

-- 
Elen sila lumenn' omentielvo

Ondrej 'SanTiago' Zajicek (email: santiago at crfreenet.org)
OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net)
"To err is human -- to blame it on a computer is even more so."
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <http://trubka.network.cz/pipermail/bird-users/attachments/20120326/6364e256/attachment-0001.asc>