bird6 1.6.2 hangs doing recvmsg on netlink socket

Israel G. Lugo israel.lugo at lugosys.com
Fri Dec 2 19:46:36 CET 2016


Hello,

I am getting some random crashes in bird6, running on Debian, version
1.6.2-1~bpo8+1 from your http://bird.network.cz/debian/ repository.

I've got a single OSPF instance with 74 routes, one eBGP session
receiving a default route, and one iBGP session with another Bird
router, which sends me its own default.

What happens is that, from time to time, bird6 becomes stuck in an
infinite loop doing recvmsg() on a netlink socket, and IPv6 routes are
lost. The interval seems random; it's been 3 days, and it's also been 2
weeks.


gk1 # strace -p 11465
recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
[...]

File descriptor 7 is a netlink socket:

gk1 # lsof -p 11465
COMMAND   PID USER   FD      TYPE             DEVICE SIZE/OFF      NODE NAME
bird6   11465 bird  cwd       DIR              253,0     4096         2 /
bird6   11465 bird  rtd       DIR              253,0     4096         2 /
bird6   11465 bird  txt       REG              253,0   540648    787381
/usr/sbin/bird6
bird6   11465 bird  mem       REG              253,0    47712    659204
/lib/x86_64-linux-gnu/libnss_files-2.19.so
bird6   11465 bird  mem       REG              253,0    43592    659208
/lib/x86_64-linux-gnu/libnss_nis-2.19.so
bird6   11465 bird  mem       REG              253,0    89104    659199
/lib/x86_64-linux-gnu/libnsl-2.19.so
bird6   11465 bird  mem       REG              253,0    31632    659200
/lib/x86_64-linux-gnu/libnss_compat-2.19.so
bird6   11465 bird  mem       REG              253,0  1738176    659160
/lib/x86_64-linux-gnu/libc-2.19.so
bird6   11465 bird  mem       REG              253,0   137440    655379
/lib/x86_64-linux-gnu/libpthread-2.19.so
bird6   11465 bird  mem       REG              253,0   140928    655799
/lib/x86_64-linux-gnu/ld-2.19.so
bird6   11465 bird    0u      CHR                1,3      0t0      1028
/dev/null
bird6   11465 bird    1u      CHR                1,3      0t0      1028
/dev/null
bird6   11465 bird    2u      CHR                1,3      0t0      1028
/dev/null
bird6   11465 bird    3u     unix 0xffff8803269f7c00      0t0 127941139
socket
bird6   11465 bird    4u     unix 0xffff8803269f7480      0t0 127941145
/run/bird/bird6.ctl
bird6   11465 bird    5u  netlink                         0t0 127906248
ROUTE
bird6   11465 bird    6u  netlink                         0t0 127906249
ROUTE
bird6   11465 bird    7u  netlink                         0t0 127906250
ROUTE
bird6   11465 bird    8u     IPv6          127906251      0t0       TCP
*:bgp (LISTEN)
bird6   11465 bird    9u     raw6                         0t0 127906252
00000000000000000000000000000000:0059->00000000000000000000000000000000:0000
st=07
bird6   11465 bird   10u     IPv6          127994711      0t0       TCP
e0.gk1:bgp->e0.gk2:39074 (CLOSE_WAIT)
bird6   11465 bird   11u     IPv6          127965176      0t0       TCP
[2001:w:y:x::133]:58268->[2001:w:y:x::1]:bgp (CLOSE_WAIT)

Unfortunately I didn't find any debug symbols for this package, so all I
could get from gdb was the following:

(gdb) bt
#0  0x00007f5ad1705e80 in __recvmsg_nocancel () at
../sysdeps/unix/syscall-template.S:81
#1  0x00007f5ad1b90428 in ?? ()
#2  0x00007f5ad1b8956b in ?? ()
#3  0x00007f5ad1b8a06b in ?? ()
#4  0x00007f5ad1b3f0c7 in ?? ()
#5  0x00007f5ad136db45 in __libc_start_main (main=0x7f5ad1b3eb10,
argc=5, argv=0x7ffe8cfece28, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>, stack_end=0x7ffe8cfece18)
    at libc-start.c:287
#6  0x00007f5ad1b3f3ec in ?? ()
(gdb) info r
rax            0xfffffffffffffff5       -11
rbx            0x7f5ad32aefe0   140028066590688
rcx            0xffffffffffffffff       -1
rdx            0x0      0
rsi            0x7ffe8cfecb70   140731263929200
rdi            0x7      7
rbp            0x7f5ad1dba270   0x7f5ad1dba270
rsp            0x7ffe8cfecb18   0x7ffe8cfecb18
r8             0x7f5ad32aefe0   140028066590688
r9             0x0      0
r10            0x1      1
r11            0x246    582
r12            0x0      0
r13            0x7f5ad32c7f60   140028066692960
r14            0x100    256
r15            0x0      0
rip            0x7f5ad1705e80   0x7f5ad1705e80 <__recvmsg_nocancel+7>
eflags         0x246    [ PF ZF IF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0


Unfortunately, I did not have debug on when this crashed. I had it on
for several days, but either I was "lucky" or the debug prevented the
crash somehow. I was having several MB worth of debug logs every day, so
I ended up disabling debug.

I'm not 100% sure that this was installed from your CZ repository, it
may have been from Debian backports. But I'm 95% sure it came from CZ.
In any case the MD5 is as follows:

56e48e8e5a1380b384f1758df2077e53  bird_1.6.2-1~bpo8+1_amd64.deb

I have now upgraded to 1.6.2-3~bpo8+1, from your CZ repository.

I can provide the configuration file off-list, if that helps.

Regards,

Israel



More information about the Bird-users mailing list