Bird stops responding to BFD messages

Pavlos Parissis pavlos.parissis at gmail.com
Tue Nov 1 23:03:15 CET 2016


Hello,

We have 1.4.5 running on ~50 CentOS 7 servers and we have observed that Bird
daemon stops responding on BFD messages which causes the BGP peering to be stopped
and started again.

Some details on our setup.
Servers have 2 interfaces (north and south) and advertise /32 prefixes to the
north and south for IPs assigned to loopback interface.

Bird receives 'Received: Other configuration change' message over BGP from both
peers, which are 2 different arista switches, at the same time. Tracing on the
switches shows that Bird didn't respond on 3 BFD messages and arista informed Bird
about it. It is very unlikely that switches or cables are the problem here.

Bird will reestablish BGP peering with ~50secs and server will start receive
traffic again. So, we have an outage that for about ~50seconds.

We have seen these to different servers connecting to different switches.
The frequency of the problem is very low, once per 2 days across 50 servers.
Furthermore, there is any specific pattern in CPU, traffic, interface/TCP errors
and something else which can help us to debug.
I have seen in 2-3 cases that CPU usage at system jumps to 70% at the timestamp
of the log that says 'BGP1: Received: Other configuration change' but the main
software on the servers (HAProxy) which consumes most of the resources doesn't
have any spike in CPU.

Has anyone seen this behavior?



Nov 01 16:23:00 bird[1376]: BGP1: Received: Other configuration change
Nov 01 16:23:00 bird[1376]: BGP1: BGP session closed
Nov 01 16:23:00 bird[1376]: BGP1: State changed to stop
Nov 01 16:23:00 bird[1376]: BGP1: Down
Nov 01 16:23:00 bird[1376]: bfd1: Session to 2.2.2.2 removed
Nov 01 16:23:00 bird[1376]: BGP1: State changed to down
Nov 01 16:23:00 bird[1376]: BGP1: Starting
Nov 01 16:23:00 bird[1376]: BGP1: State changed to start
Nov 01 16:23:00 bird[1376]: bfd1: Session to 2.2.2.2 added
Nov 01 16:23:00 bird[1376]: BGP1: Startup delayed by 60 seconds
Nov 01 16:23:00 bird[1376]: BGP2: Received: Other configuration change
Nov 01 16:23:00 bird[1376]: BGP2: BGP session closed
Nov 01 16:23:00 bird[1376]: BGP2: State changed to stop
Nov 01 16:23:00 bird[1376]: BGP2: Down
Nov 01 16:23:00 bird[1376]: bfd1: Session to 1.1.1.1 removed
Nov 01 16:23:00 bird[1376]: BGP2: State changed to down
Nov 01 16:23:00 bird[1376]: BGP2: Starting
Nov 01 16:23:00 bird[1376]: BGP2: State changed to start
Nov 01 16:23:00 bird[1376]: bfd1: Session to 1.1.1.1 added
Nov 01 16:23:00 bird[1376]: BGP2: Startup delayed by 60 seconds
Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 1.1.1.1 - unknown session id
(3840383752)
Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 2.2.2.2 - unknown session id
(3632088877)
Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 1.1.1.1 - unknown session id
(3840383752)
Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 2.2.2.2 - unknown session id
(3632088877)
Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 1.1.1.1 - unknown session id
(3840383752)
Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 2.2.2.2 - unknown session id
(3632088877)
Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 1.1.1.1 - unknown session id
(3840383752)
Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 2.2.2.2 - unknown session id
(3632088877)
Nov 01 16:23:00 bird[1376]: bfd1: Bad packet from 1.1.1.1 - unknown session id
(3840383752)

bird.conf:
log syslog { debug, trace, info, remote, warning, error, auth, fatal, bug };
include "/etc/bird.d/*.conf";

router id 1.1.1.2;

protocol device {
    scan time 10;
}

protocol static {
    disabled yes;
}

protocol direct direct1 {
    interface "lo";
        debug all;
        export none;
        import where net ~ ANYCAST_NETWORKS;
}

protocol bfd bfd1 {
    debug { states, routes, filters, interfaces, events };
    interface "north", "south" {
        min rx interval 400 ms;
        min tx interval 400 ms;
        idle tx interval 1000 ms;
        multiplier 3;
    };
}

protocol bgp BGP1 {
    disabled no;
    description "Peer-BGP1";
    neighbor 1.1.1.1 as 64824;
    source address 1.1.1.2;
    bfd on;
    debug all;
    import none;
    export where match_route_north();
    direct;
    hold time 10;
    startup hold time 240;
    connect retry time 120;
    keepalive time 3;
    start delay time 5;
    error wait time 60, 300;
    error forget time 300;
    disable after error off;
    next hop self;
    path metric 1;
    default bgp_med 0;
    default bgp_local_pref 0;
    local as 64825;
}

protocol bgp BGP2 {
    disabled no;
    description "Peer-BGP2";
    neighbor 2.2.2.2 as 64827;
    source address 2.2.2.1;
    bfd on;
    debug all;
    import none;
    export where match_route_south();
    direct;
    hold time 10;
    startup hold time 240;
    connect retry time 120;
    keepalive time 3;
    start delay time 5;
    error wait time 60, 300;
    error forget time 300;
    disable after error off;
    next hop self;
    path metric 1;
    default bgp_med 0;
    default bgp_local_pref 0;
    local as 64825;
}

Cheers,
Pavlos

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://trubka.network.cz/pipermail/bird-users/attachments/20161101/92ff6394/attachment.asc>


More information about the Bird-users mailing list