bird6 1.6.2 hangs doing recvmsg on netlink socket

Israel G. Lugo israel.lugo at lugosys.com
Fri Dec 9 13:26:33 CET 2016


Just had another crash, 7 days after my previous email. Exact same
symptoms, this time with the latest version from CZ repository:
1.6.2-3~bpo8+1.

bird6 stuck on recvmsg using 100% CPU, getting EAGAIN in an infinite loop:

# strace -p 23020
recvmsg(7, 0x7ffc45ae0ab0, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
recvmsg(7, 0x7ffc45ae0ab0, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
recvmsg(7, 0x7ffc45ae0ab0, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
recvmsg(7, 0x7ffc45ae0ab0, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
recvmsg(7, 0x7ffc45ae0ab0, 0)           = -1 EAGAIN (Resource
temporarily unavailable)
[...]

None of this happened in 1.5.0.

What can I do to help troubleshoot this? This is a major regression and
it's making me seriously concerned about both my edge routers using the
same version of Bird.



On 12/02/2016 06:46 PM, Israel G. Lugo wrote:
> Hello,
>
> I am getting some random crashes in bird6, running on Debian, version
> 1.6.2-1~bpo8+1 from your http://bird.network.cz/debian/ repository.
>
> I've got a single OSPF instance with 74 routes, one eBGP session
> receiving a default route, and one iBGP session with another Bird
> router, which sends me its own default.
>
> What happens is that, from time to time, bird6 becomes stuck in an
> infinite loop doing recvmsg() on a netlink socket, and IPv6 routes are
> lost. The interval seems random; it's been 3 days, and it's also been 2
> weeks.
>
>
> gk1 # strace -p 11465
> recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
> temporarily unavailable)
> recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
> temporarily unavailable)
> recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
> temporarily unavailable)
> recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
> temporarily unavailable)
> recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
> temporarily unavailable)
> recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
> temporarily unavailable)
> recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
> temporarily unavailable)
> recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
> temporarily unavailable)
> recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
> temporarily unavailable)
> recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
> temporarily unavailable)
> recvmsg(7, 0x7ffe8cfecb70, 0)           = -1 EAGAIN (Resource
> temporarily unavailable)
> [...]
>
> File descriptor 7 is a netlink socket:
>
> gk1 # lsof -p 11465
> COMMAND   PID USER   FD      TYPE             DEVICE SIZE/OFF      NODE NAME
> bird6   11465 bird  cwd       DIR              253,0     4096         2 /
> bird6   11465 bird  rtd       DIR              253,0     4096         2 /
> bird6   11465 bird  txt       REG              253,0   540648    787381
> /usr/sbin/bird6
> bird6   11465 bird  mem       REG              253,0    47712    659204
> /lib/x86_64-linux-gnu/libnss_files-2.19.so
> bird6   11465 bird  mem       REG              253,0    43592    659208
> /lib/x86_64-linux-gnu/libnss_nis-2.19.so
> bird6   11465 bird  mem       REG              253,0    89104    659199
> /lib/x86_64-linux-gnu/libnsl-2.19.so
> bird6   11465 bird  mem       REG              253,0    31632    659200
> /lib/x86_64-linux-gnu/libnss_compat-2.19.so
> bird6   11465 bird  mem       REG              253,0  1738176    659160
> /lib/x86_64-linux-gnu/libc-2.19.so
> bird6   11465 bird  mem       REG              253,0   137440    655379
> /lib/x86_64-linux-gnu/libpthread-2.19.so
> bird6   11465 bird  mem       REG              253,0   140928    655799
> /lib/x86_64-linux-gnu/ld-2.19.so
> bird6   11465 bird    0u      CHR                1,3      0t0      1028
> /dev/null
> bird6   11465 bird    1u      CHR                1,3      0t0      1028
> /dev/null
> bird6   11465 bird    2u      CHR                1,3      0t0      1028
> /dev/null
> bird6   11465 bird    3u     unix 0xffff8803269f7c00      0t0 127941139
> socket
> bird6   11465 bird    4u     unix 0xffff8803269f7480      0t0 127941145
> /run/bird/bird6.ctl
> bird6   11465 bird    5u  netlink                         0t0 127906248
> ROUTE
> bird6   11465 bird    6u  netlink                         0t0 127906249
> ROUTE
> bird6   11465 bird    7u  netlink                         0t0 127906250
> ROUTE
> bird6   11465 bird    8u     IPv6          127906251      0t0       TCP
> *:bgp (LISTEN)
> bird6   11465 bird    9u     raw6                         0t0 127906252
> 00000000000000000000000000000000:0059->00000000000000000000000000000000:0000
> st=07
> bird6   11465 bird   10u     IPv6          127994711      0t0       TCP
> e0.gk1:bgp->e0.gk2:39074 (CLOSE_WAIT)
> bird6   11465 bird   11u     IPv6          127965176      0t0       TCP
> [2001:w:y:x::133]:58268->[2001:w:y:x::1]:bgp (CLOSE_WAIT)
>
> Unfortunately I didn't find any debug symbols for this package, so all I
> could get from gdb was the following:
>
> (gdb) bt
> #0  0x00007f5ad1705e80 in __recvmsg_nocancel () at
> ../sysdeps/unix/syscall-template.S:81
> #1  0x00007f5ad1b90428 in ?? ()
> #2  0x00007f5ad1b8956b in ?? ()
> #3  0x00007f5ad1b8a06b in ?? ()
> #4  0x00007f5ad1b3f0c7 in ?? ()
> #5  0x00007f5ad136db45 in __libc_start_main (main=0x7f5ad1b3eb10,
> argc=5, argv=0x7ffe8cfece28, init=<optimized out>, fini=<optimized out>,
> rtld_fini=<optimized out>, stack_end=0x7ffe8cfece18)
>     at libc-start.c:287
> #6  0x00007f5ad1b3f3ec in ?? ()
> (gdb) info r
> rax            0xfffffffffffffff5       -11
> rbx            0x7f5ad32aefe0   140028066590688
> rcx            0xffffffffffffffff       -1
> rdx            0x0      0
> rsi            0x7ffe8cfecb70   140731263929200
> rdi            0x7      7
> rbp            0x7f5ad1dba270   0x7f5ad1dba270
> rsp            0x7ffe8cfecb18   0x7ffe8cfecb18
> r8             0x7f5ad32aefe0   140028066590688
> r9             0x0      0
> r10            0x1      1
> r11            0x246    582
> r12            0x0      0
> r13            0x7f5ad32c7f60   140028066692960
> r14            0x100    256
> r15            0x0      0
> rip            0x7f5ad1705e80   0x7f5ad1705e80 <__recvmsg_nocancel+7>
> eflags         0x246    [ PF ZF IF ]
> cs             0x33     51
> ss             0x2b     43
> ds             0x0      0
> es             0x0      0
> fs             0x0      0
> gs             0x0      0
>
>
> Unfortunately, I did not have debug on when this crashed. I had it on
> for several days, but either I was "lucky" or the debug prevented the
> crash somehow. I was having several MB worth of debug logs every day, so
> I ended up disabling debug.
>
> I'm not 100% sure that this was installed from your CZ repository, it
> may have been from Debian backports. But I'm 95% sure it came from CZ.
> In any case the MD5 is as follows:
>
> 56e48e8e5a1380b384f1758df2077e53  bird_1.6.2-1~bpo8+1_amd64.deb
>
> I have now upgraded to 1.6.2-3~bpo8+1, from your CZ repository.
>
> I can provide the configuration file off-list, if that helps.
>
> Regards,
>
> Israel
>



More information about the Bird-users mailing list