Route Flapping and Bird - RFD and MRAI

Douglas Fischer fischerdouglas at gmail.com
Mon Feb 14 14:54:23 CET 2022


I'll take the liberty of describing the context that makes me see MRAI as
IMPORTANT for IXP Route-Servers.
And explain why I keep bothering the BIRD team about this.

Today I have the opportunity to be involved in the operation of several ASN
routers connected to several IXPs and their Route-Servers.

On some occasions we have received complaints from various consultant
customers about inappropriate behavior on certain routers. Excessive CPU
usage, and prolonged time to align RIB and FIB changes.

In the diagnosis of these cases, we found that the BGP process suffered
from a queue in the processing of update-messages coming from the
Route-Servers of large IXPs to which these routers were connected.

---

I'll try to be succinct to describe the context (this is a challenge for
me):
Scrutinizing the BGP messages that caused this overload on these routers
and correlating this with networking news and facts from the networks we
had some kind of contact with, we observed that:
When there was an event of great impact that caused a variation in
connectivity (direct or indirect) in the networks of the participants of an
IXP, we found that the routers (targets of this troubleshooting) received,
almost at the same time, A HUGE QUANTITY of BGP-Messages from the Route-
serves the IXPs informing that "the ABCD/M route is available for the XPTO
participant", and immediately after that "the ABCD/M route IS NO longer
available for the XPTO participant".
And notifications from these same routes were reporting this in almost half
of the route-servers Peers.

---

I will make a sanitized example of such an event:
- Router of our consulting customer connected to two large IXPs (with more
than 1000 participants), one in Brazil and another in Europe (both with
their Route-Servers based on Bird).
- Some BIG event such as: the isolation of a facility of this IXP, or the
failure of an important physical link in that region/country, happens.
- Just a subset of routes is actually affected (say 40-50 routes)
- About half of the thousands of networks connected to this IXP take some
time to stabilize this subset of routes on their backbones.
- During this time, all these participants receive notifications and resend
notifications about these changes in the status of these routes to the
Route-Servers of that IXP (which are BIRD-based).
- And in a difference of 2 seconds or less, these Route-Servers propagate
changes that will not have a practical result in these routes, but generate
a repetition of re-notifications, in an almost vicious cycle.

Some additional findings that we were able to make:
- Routers in this situation that have more torque for BGP processing (Ex.:
16 threads, 16GB of RAM.) also feel the impact of this avalanche of
notifications, but manage to get out of this overload much faster.
- Weaker routers to process BGP (Ex.: 4 threads, 4GB of RAM.) but with good
traffic forwarding capabilities (~ 1Tbps) in that same event suffer for
SEVERAL MINUTES until they return to a stable condition and align the RIB
and the FIB.
- Status changes and notifications also occur on this same subset of routes
via BGP sessions with transit providers. But status changes usually only
happen 2 times.
- Complement -> In a specific transit provider we also observed that this
flood of notifications about this same subset of routes happens. We have
confirmed with the operators of this network that the BGP engine of this
transit provider is also Bird 2.0.x.

...

Q.: Hey Douglas, and where did you get this idea to relate this to BIRD?

Unfortunately, I was not able to do a detailed analysis in order to prove
the relationship between BIRD and these BGP-Updates floods.
However considering:
 - The purpose for which the MRAI was created, and the documentation from
some of the Routing equipment vendors on the use of this feature.
 - Probable feedbacks between different networks connected to the same
route-server.
 - Possibility of these feedbacks to enhance BGP notifications.
I believe this correlation is quite plausible.

I imagine we have colleagues here with more experience who can complement
or refute this theory. I'm open to hearing suggestions.

Em seg., 14 de fev. de 2022 às 10:48, Douglas Fischer <
fischerdouglas at gmail.com> escreveu:

> Passing by here to drop another ping on the MRAI - Minimum Route
> Advertisement Interval in the BIRD, and also on the RFD -
> Route-Flap-Dampening.
>
> Em qui., 8 de jul. de 2021 às 17:18, Douglas Fischer <
> fischerdouglas at gmail.com> escreveu:
>
>> Hello all.
>>
>> Last weeks I'm noticing the increase the increase of BGP messages on some
>> routers that I have access to.
>> Specially those connected to IXPs with a considerable number of
>> participants.
>> I Checked and the BGP process has increased a bit the use of CPU Also.
>> So, I suspect that Route-Servers are receiving and forwarding Route Flaps
>> from one or more participants...
>>
>> I know that the right place to fix that would be on the source of those
>> flaps...
>> But... Considering IXPs with one ou two thousand participants, there will
>> always be someone flapping some routes.
>>
>> Bird is widely used as the BGP engine on IXP route-servers.
>> And exactly because of that I came here to ask about Route Flap avoidance
>> mechanisms on Bird.
>>
>> The discussion of methods to avoid Route Flapping in BGP is very
>> controversial.
>>  - RFD - Route Flapping Dampening comes and goes.
>>  - MRAI is not a consensus, but there is a lot of effort on that,
>> including som magic algorithms define some variable interval according to
>> conditions.
>>  - I even saw some ultra-rigorous ideas like rate limiting the BGP
>> messages...
>>
>> Before I come here to bore you with this, I searched a bit on mail list
>> and documentation of Bird and I found the links bellow related to that:
>>
>> There is any work running to deal with that route-flapping question?
>>
>> https://bird.network.cz/ - 7.1 Future work
>> <https://bird.network.cz/?get_doc&v=20&f=bird-7.html>
>> Question about BGP advertisement-interval in BIRD
>> <http://trubka.network.cz/pipermail/bird-users/2020-August/014786.html>
>> [BGP] MRAI connection-based implementation review
>> <http://trubka.network.cz/pipermail/bird-users/2020-April/014433.html>
>> Route Flap Dampening
>> <http://trubka.network.cz/pipermail/bird-users/2019-July/013589.html>
>> [BGP] bird and RFD
>> <http://trubka.network.cz/pipermail/bird-users/2020-May/014612.html>
>>
>> Thank you all!
>>
>> --
>> Douglas Fernando Fischer
>> Engº de Controle e Automação
>>
>
>
> --
> Douglas Fernando Fischer
> Engº de Controle e Automação
>


-- 
Douglas Fernando Fischer
Engº de Controle e Automação
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://trubka.network.cz/pipermail/bird-users/attachments/20220214/cca8daf6/attachment.htm>


More information about the Bird-users mailing list