bird incorrectly merging next hops

Bernardo Figueiredo bernardopascoalfigueiredo at tecnico.ulisboa.pt
Thu Dec 15 13:42:37 CET 2016


Hi,

At our network we use bird in all our Linux routers so they talk
OSPF.  Recently
we added a new router appliance(not Linux) to the backbone and most of
the bird
routers showed wrong next hops for some routes.  At first we tought that
the
problem might be caused by a bug in the appliance making it send bad
LSAs, but
debugging the ospf packets and looking at other non-bird router
appliances made
us certain that the fault was in bird.

After looking through the code we found that the bug is in function
'ospf_ext_spf' in file 'proto/ospf/rt.c'.  The variable 'nfa' of type
'struct
orta' is not totally cleaned before each start of the 'WALK_SLIST' loop,
and can
have a wrong value of 'nfa.nhs_reuse = 1' instead of 'nfa.nhs_reuse =
0', which
is what triggers the bug.

The fix is to put the declaration of 'orta nfa = {};' inside the loop,
so the
variable is clean at the start of each loop. Alternatively at the start
of each
loop you could set 'nfa.nhs_reuse = 0'.

The bug only occurs when the following options are in bird.conf:
ecmp yes;
merge external yes;

At least from what I saw the bug affects versions '2.0', '1.6.2' and
'1.5'.

If some of the explanation is not clear please tell me and I'll try to
write it
better.

I'll provide a more in-depth explanation now:

the relevant code is:
the code is 'tagged' with 'x->' which are the relevant points for
trigerring the
bug

static void
ospf_ext_spf(struct ospf_proto *p) {
  ort *nf1, *nf2;
0->orta nfa = {};
  ip_addr rtid;

  OSPF_TRACE(D_EVENTS, "Starting routing table calculation for ext
routes");

  WALK_SLIST(en, p->lsal) { 
    /*not showing LSA validation*/

    rtid = ipa_from_rid(en->lsa.rt);
1-> nf1 = fib_find(&atmp->rtr, &rtid, MAX_PREFIX_LENGTH);

    /*not showing other validations*/

    if (!rt.fbit) {
       nf2 = nf1;
2->    nfa.nhs = nf1->n.nhs;
       br_metric = nf1->n.metric1;
    } else {
       nf2 = ospf_fib_route(&p->rtf, rt.fwaddr, MAX_PREFIX_LENGTH);
       if (!nf2) { continue; }
       /*not showing some validations*/
       nfa.nhs = nf2->n.nhs;
       br_metric = nf2->n.metric1;

       if (has_device_nexthops(nfa.nhs)) {
         nfa.nhs = fix_device_nexthops(p, nfa.nhs, rt.fwaddr);
3->       nfa.nhs_reuse = 1;
       }
    }

    /*not showing setting other attributes in 'nfa'*/

4-> ri_install_ext(p, rt.ip, rt.pxlen, &nfa);
  }
}

static inline void
ri_install_ext(struct ospf_proto *p, ip_addr prefix, int pxlen, const
orta *new)
{
5->ort *old = (ort *) fib_get(&p->rtf, &prefix, pxlen);

  int cmp = orta_compare_ext(p, new, &old->n);

  if (cmp > 0) {
    ort_replace(old, new);
  } else if (cmp == 0) {
6-> ort_merge_ext(p, old, new);
  }
}

static void
ort_merge_ext(struct ospf_proto *p, ort *o, const orta *new) {
  orta *old = &o->n;

  if (old->nhs != new->nhs) {
7->    old->nhs = mpnh_merge(old->nhs, new->nhs, old->nhs_reuse, new-
>nhs_reuse,
p->ecmp, p->nhpool);
    old->nhs_reuse = 1;
  }

/* not showing code that sets other attributes */
}

struct mpnh *
mpnh_merge(struct mpnh *x, struct mpnh *y, int rx, int ry, int max,
linpool *lp)
{
  struct mpnh *root = NULL;
  struct mpnh **n = &root;

  while ((x || y) && max--) {
    int cmp = mpnh_compare_node(x, y);
    if (cmp < 0) {
      *n = rx ? x : mpnh_copy_node(x, lp);
      x = x->next;
    } else if (cmp > 0) {
8->   *n = ry ? y : mpnh_copy_node(y, lp);
      y = y->next;
    } else {
9->   *n = rx ? x : (ry ? y : mpnh_copy_node(x, lp));
      x = x->next;
      y = y->next;
    }
    n = &((*n)->next);
  }
  *n = NULL;

  return root;
}

conditions to trigger the bug:
* while in the loop there must be a LSA that has a forwading address
which
  routes satisfy the conditions to trigger 'has_device_nexthops' so
  'nfa.nhs_reuse' is set to 1, this corresponds to '3->'.
Afterwards the bug could be trigger in any LSA that has the following
properties:
* LSA forwarding address equal to 0.0.0.0; so we set 'nfa' to the route
to the
  router that sent the LSA. This last route is the one might be
corrupted. This
corresponds to '1->' and '2->';
* the route passes the validations to be installed in the OSPF routing
table.
  This corresponds to '4->'
* the old route that we had for the destination on the LSA is equally
good to
  the new route that we are installing so we decide to merge them. This
  corresponds to '5->' and '6->'
* The routes aren't the same instance so we decide to merge them. '7->'
* While merging the next hop list we must compare at least one next hop
from the
  'new' route  higher than a next hop from the 'old' route so we enter
'8->' and
  wrongly not copy the next hop because 'ry = 1' due to 'nfa.nhs_reuse =
1'.

This might also happen in '9->' if the next hops compare equal, but I
haven't
checked.


More information about the Bird-users mailing list