Bird dying on nl_get_reply

Ondrej Zajicek santiago at crfreenet.org
Tue May 2 16:47:36 CEST 2017


On Tue, May 02, 2017 at 03:16:08PM +0200, Vincent Bernat wrote:
> Hey!
> 
> Just got an instance of BIRD dying unexpectedly after displaying the
> following message:
> 
> nl_get_reply: No buffer space available
> 
> It's from netlink.c:
> 
> 	  int x = recvmsg(nl->fd, &m, 0);
> 	  if (x < 0)
> 	    die("nl_get_reply: %m");
> 
> Manpage for netlink(7) says an application should expect such a
> condition:
> 
>        However, reliable transmissions from kernel to user are
>        impossible in any case.  The kernel can't send a netlink message
>        if the socket buffer is full: the message will be dropped and the
>        kernel and the user-space process will no longer have the same
>        view of kernel state.  It is up to the application to detect when
>        this happens (via the ENOBUFS error returned by recvmsg(2)) and
>        resynchronize.


> 
> Another possibility would be to use NETLINK_NO_ENOBUFS socket option:
> 
>        This flag can be used by unicast and broadcast listeners to avoid
>        receiving ENOBUFS errors.
> 
> I don't think using this flag is a good idea.
> 
> I thought this problem has already been reported recently, but I didn't
> find the thread back. The receive buffer could be increased dynamically
> when this happens.

Hi

There was this commit [*] that corrected hang on netlink socket, but that
was related to async netlink socket, while this problem is related to
another netlink socket (BIRD uses three netlink sockets, one for
requests/route changes, one for synchronous scans and one for asynchronous
notifications).

[*] https://gitlab.labs.nic.cz/labs/bird/commit/2c33da507046c25d87741fe0ce7947985c8c7a10

Problem with buffer size is strange in this case, because BIRD uses
nl_get_reply() in request/reply manner waiting for answer from the last request,
so the buffer should be empty when a request was sent.

> Or maybe we could just ignore the error and wait for
> the next kernel sync to catch up. Or the 8192 value could be configured
> at build-time. What's the best option?

Well, you could try increase NL_RX_SIZE to say 64k. But the best solution
would be to have a proper error handling in nl_get_reply(). The main
question is why the buffer was full as it is not the buffer that get
async notifications, it just gets responses.

-- 
Elen sila lumenn' omentielvo

Ondrej 'Santiago' Zajicek (email: santiago at crfreenet.org)
OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net)
"To err is human -- to blame it on a computer is even more so."


More information about the Bird-users mailing list