[PATCH] crash in ospf DBDESC

Александр Черников melifaro at yandex-team.ru
Fri Feb 20 13:37:04 CET 2015


I've got the following core for bird 1.4.5:

(gdb) bt
#0  0x000000000043ecfa in ospf_dbdes_send (n=0x8011018a0, next=1) at ../../../proto/ospf/dbdes.c:145
#1  0x000000000043f6b2 in ospf_dbdes_receive (ps_i=0x8010c9000, ifa=0x8011131a0, n=0x8011018a0) at ../../../proto/ospf/dbdes.c:386
#2  0x0000000000438fdd in ospf_rx_hook (sk=0x80101b8c0, size=28) at ../../../proto/ospf/packet.c:485
#3  0x000000000045f972 in sk_read (s=0x80101b8c0) at io.c:1760
#4  0x000000000046034b in io_loop () at io.c:1975
#5  0x0000000000467da3 in main (argc=3, argv=0x7fffffffed30) at main.c:825
(gdb) p n->dbsi
$20 = {prev = 0x0, null = 0x0, next = 0x0, node = 0x0}
(gdb) p sn
$22 = (snode *) 0x0

Investigations has shown, that there was major OSPF instability in that area (~20 quagga boxes and and Juniper device) at that moment with either re-election or DR/BDR hang.
Unfortunately, I don't have much logs for that. We also had an (typical) issue with this particular quagga peer just prior to the crash:
Feb 19 18:28:34 XXX ospf6d[8387]: SLOW THREAD: task ospf6_receive (7f793a115810) ran for 5044ms (cpu time 5032ms)

My guess is that
1) we started to send our DB to the peer and it stopped confirming DD packets for a while
2) Flap happened so part of/most LSADB got flushed
3) Quagga finally awoke from sleep and confirmed last packet
4) we tried to get the next chunk of LSAs but there were no more (unsent ) LSAs in DB
5) this message appeared in the list

Something similar to the attached patch should fix this particular issue (at least I hope so).
