[PATCH] Fix an issue where bird may accidently close a socket

Kenth Eriksson Kenth.Eriksson at infinera.com
Wed Jun 5 15:12:46 CEST 2019


On Wed, 2019-05-29 at 16:58 +0000, Kenth Eriksson wrote:
> On Wed, 2019-05-29 at 16:02 +0200, Ondrej Zajicek wrote:
> > On Mon, May 27, 2019 at 03:29:26PM +0000, Kenth Eriksson wrote:
> > > On Mon, 2019-05-27 at 17:12 +0200, Ondrej Zajicek wrote:
> > > > CAUTION: This email originated from outside of the
> > > > organization.
> > > > Do
> > > > not click links or open attachments unless you recognize the
> > > > sender
> > > > and know the content is safe.
> > > > 
> > > > 
> > > > On Mon, May 27, 2019 at 02:28:52PM +0200, Kenth Eriksson wrote:
> > > > > Datagram sockets may return 0 and stream sockets can return 0
> > > > > if the requested number of bytes to read is 0.
> > > > 
> > > > Hi
> > > > 
> > > > You mean that if count arg to read() is 0?
> > > > 
> > > > How that may happen?
> > > > 
> > > We have a client remote controlling bird using a socket that did
> > > get
> > > get POLLHUP, maybe due to that bird closed the socket. 
> > > 
> > > Don't think checking for 0 is enough, from man page;  
> > 
> > Hi
> > 
> > Did you examine the possible causes of why BIRD closed the socket?
> > That could be done by adding log messages to sk_read(), perhaps
> > limited
> > to cases when s->type == SK_UNIX. The man page describes two
> > additional
> > cases, so it is a good idea to try to distinguish which one
> > happened.
> > 
> We are still trying to reproduce. 
> 
> > Also, i don't think that check for POLLHUP is correct, as that flag
> > generally means socket is closed for write, not that it is closed
> > for
> > read (although some OSes are less consistent in this matter than
> > others).
> > 
> POLLOUT and POLLHUP are mutually exclusive. But POLLIN and POLLHUP
> are
> NOT mutually exclusive. But I can't see that the socket will
> transition
> from POLLHUP|POLLIN to POLLIN only.  
> 
> As I see it there are three ways to tell if a socket is closed;
> 
> 1) write returns EPIPE
> 2) read returns 0 when requesting to read n bytes where n > 0
> 3) POLLHUP from poll (irrespective if read or write)
> 
>  
> > This is particularly fragile part of code, depends on OS API
> > details,
> > so
> > i would avoid to do changes there unless we are 100% sure that they
> > are
> > proper fix for some confirmed condition.
> > 
> Unserstood. If we are able to reproduce, we should be able to confirm
> or update patch.

We have now reproduced this issue. I have added two log points into
bird to better see what's going on. The first log is where first patch
introduced the POLLHUP check. The second log point is from the CLI
error hook where a backtrace is logged. Then the following logs are
collected;

2019-06-05 13:45:49.910 <TRACE> CLI: show ospf interface ospfv3_2 "p1-
1-3-1-4"
2019-06-05 13:45:49.911 <TRACE> CLI: show interfaces
2019-06-05 13:45:49.911 <ERR> Got 0 bytes, errno=11, revents=1,
rbuf=269409024, rpos=269410048
2019-06-05 13:45:49.911 <INFO> /opt/appl/cuappl04a-r33a-
855d38b/sbin/bird() [0x10098a74] 
2019-06-05 13:45:49.911 <INFO> /opt/appl/cuappl04a-r33a-
855d38b/sbin/bird(sk_read+0x1d4) [0x10092228] 
2019-06-05 13:45:49.911 <INFO> /opt/appl/cuappl04a-r33a-
855d38b/sbin/bird(io_loop+0x4d0) [0x10093340] 
2019-06-05 13:45:49.911 <INFO> /opt/appl/cuappl04a-r33a-
855d38b/sbin/bird(main+0x1f8) [0x10099a4c] 
2019-06-05 13:45:49.911 <INFO> /lib/libc.so.6(+0x21208) [0xfe53208] 
2019-06-05 13:45:49.911 <INFO> /lib/libc.so.6(+0x213a8) [0xfe533a8]
2019-06-05 13:45:49.911 <INFO> CLI connection closed

Locking at the backtrace from gdb it looks as follows;

(gdb) list *0x10092228
0x10092228 is in sk_read
(/home/builduser/workspace/Git/PreTestJobs/build_cu4appl/tm3000/ext/bir
d/sysdep/unix/io.c:1879).
1874		}
1875	      }
1876	      else if (!c)
1877	      {
1878		log(L_ERR "Got 0 bytes, errno=%d, revents=%x, rbuf=%d,
rpos=%d", errno, revents, s->rbuf, s->rpos);
1879		s->err_hook(s, 0);
1880	      }
1881	      else
1882	      {
1883		s->rpos += c;
(gdb) list *0x10098a74
0x10098a74 is in cli_err
(/home/builduser/workspace/Git/PreTestJobs/build_cu4appl/tm3000/ext/bir
d/sysdep/unix/main.c:434).
429	#define SIZE 256
430	  void *buffer[SIZE];
431	  char **strings;
432	  int j, nptrs;
433	
434	  nptrs = backtrace(buffer, SIZE);
435	  strings = backtrace_symbols(buffer, nptrs);
436	  if (strings)
437	  {
438	      for (j = 0; strings && j<nptrs; j++)
(gdb) 

The read operation returns POLLIN (revents 1) and 0 bytes. It appears
as the read operation requested 0 bytes (269409024 + 1024 - 269410048 =
0). The command it is processing is very short ("show interfaces").


> > > "When a stream socket peer has performed an orderly shutdown, the
> > > return value will be 0 (the traditional "end-of-file" return).
> > > 
> > > Datagram sockets in various domains (e.g., the UNIX and Internet
> > > domains) permit zero-length datagrams.  When such a datagram is
> > > received, the return value is 0.
> > 
> > Well, we open the socket as SOCK_STREAM and not SOCK_DGRAM, so i
> > would
> > expect this case does not apply. (Although i have no idea what
> > would
> > happen
> > if the other side opens the socket with SOCK_DGRAM.)
> > 
> > And even if this case would happen, we should ignore it and not
> > fall
> > back
> > to call_rx_hook() path.
> > 
> > 
> > > The value 0 may also be returned if the requested number of bytes
> > > to
> > > receive from a stream socket was 0."
> > 
> > This case means that we called read() with (s->rpos == s->rbuf + s-
> > > rbsize),
> > buffer is full and last call_rx_hook() does not eat data from it.
> > That
> > means something is wrong, e.g. an incomplete message filled the
> > entire
> > buffer and without resizing buffer we would just end in endless
> > loop,
> > therefore closing the connection with error is safest thing to do.
> > 



More information about the Bird-users mailing list