BIRD crashing when --enable-debug is set

andreas at rammhold.de andreas at rammhold.de
Sun Nov 22 16:47:15 CET 2020


Hi,

I've been playing with the BIRD source code and implementing a new
extension to Babel but that is not what this mail is about. While doing
that I noticed passing --enable-debug to configure causes BIRD to crash
regardless of configuration:

> $ bt
> #0  0x00007ffff7d9741a in raise () from /nix/store/2wrfwfdpklhaqhjxgq6yd257cagdxgph-glibc-2.32/lib/libc.so.6
> #1  0x00007ffff7d81523 in abort () from /nix/store/2wrfwfdpklhaqhjxgq6yd257cagdxgph-glibc-2.32/lib/libc.so.6
> #2  0x00000000004912c7 in bug (msg=msg at entry=0x4a06d8 "Assertion '%s' failed at %s:%d") at sysdep/unix/log.c:281
> #3  0x000000000049175f in add_tail (n=0x4e4880 <lc_syslog>, l=0x4e8060 <log_list>) at ./lib/lists.c:79
> #4  default_log_list (syslog_name=<synthetic pointer>, initial=0x0) at sysdep/unix/log.c:345
> #5  log_switch (initial=initial at entry=0x0, logs=logs at entry=0x4eecb0, new_syslog_name=0x0) at sysdep/unix/log.c:373
> #6  0x000000000049216b in sysdep_commit (new=new at entry=0x4eec70, old=<optimized out>) at sysdep/unix/main.c:188
> #7  0x0000000000420a1a in config_do_commit (c=c at entry=0x4eec70, type=type at entry=0x1) at conf/conf.c:265
> #8  0x0000000000420c29 in config_commit (c=c at entry=0x4eec70, type=type at entry=0x1, timeout=timeout at entry=0x0) at conf/conf.c:361
> #9  0x0000000000492f3a in main (argc=<optimized out>, argv=<optimized out>) at sysdep/unix/main.c:912
> #10 0x00007ffff7d82dbd in __libc_start_main () from /nix/store/2wrfwfdpklhaqhjxgq6yd257cagdxgph-glibc-2.32/lib/libc.so.6
> #11 0x000000000041087a in _start () at ../sysdeps/x86_64/start.S:120

The statement triggering this is the check for n->prev == NULL in
lib/lists.c:79:
   75 │ LIST_INLINE void
   76 │ add_tail(list *l, node *n)
   77 │ {
   78 │   EXPENSIVE_CHECK(check_list(l, NULL));
-> 79 │   ASSUME(n->prev == NULL);
   80 │   ASSUME(n->next == NULL);
   81 │
   82 │   node *z = l->tail;
   83 │
   84 │   n->next = &l->tail_node;
   85 │   n->prev = z;
   86 │   z->next = n;
   87 │   l->tail = n;
   88 │ }

Note: Expensive checks are disabled.

To reproduce build BIRD on the current master (3347aaf as of this
writing) pass --enable-debug to the configure script.

Create a simple configuration file that contains at least one protocol.
I used just "protocol device {}" as the whole of the configuration file.

Then execute "./bird -c minimal.conf -s test.ctl -f" and observe it
crash.

Bisecting the issue from v2.0.7 to the latest master branch showed that
this seems to have been broken / uncovered in baac700:

> commit baac7009063d94799821422ecc63ea2af41561ea (refs/bisect/bad)
> Author: Maria Matejka <mq at ucw.cz>
> Date:   Wed Aug 14 16:23:58 2019 +0200
> 
>     List expensive check.
(https://gitlab.nic.cz/labs/bird/-/commit/baac7009063d94799821422ecc63ea2af41561ea)

Looking at the backtrace we can see that `default_log_list` is being
called just before the check is being made:

sysdep/unix/log.c
   341 │ #ifdef HAVE_SYSLOG_H
   342 │   if (!dbgf)
   343 │     {
   344 │       static struct log_config lc_syslog = { .mask = ~0 };
-> 345 │       add_tail(&log_list, &lc_syslog.n);
   346 │       *syslog_name = bird_name;
   347 │     }
   348 │ #endif

This code path is being called twice during the startup. One of the call
chains is the one visible in the backtrace (which happens after/during
configuration commit) and the other happens very early in the process
startup:

sysdep/unix/main.c
   843 │ int
   844 │ main(int argc, char **argv)
   845 │ {
 …
   851 │   parse_args(argc, argv);
-> 852 │   log_switch(1, NULL, NULL);
 …
   912 │   config_commit(conf, RECONFIG_HARD, 0); // 2nd call, see bt

Since the log_config struct is static (sysdep/unix/log.c:344) the same
entry is being added twice to the log_list and thus at the 2nd attempt
of adding it the prev field is no longer the expected NULL value and
thus triggering the shown abort.

The log_list itself is also static (in default_log_list) but initialized
on every execution of the function.  Probably purposely. This still
leaves for a brief moment where structures are inconsistent. I am not
yet familiar enough with the BIRD code base to reason about where
threading is use and if this is an actual issue or not.

Now the question is where the fix should happen? We could add code to
the log_close logic that resets the node pointers of each of the
entries? Another option could be in the default_log_list function by
just reassigning the value of each fo the log_config structs on every
single call to the function. As the log_list is also initialized on
every call this is perhaps the simplest but also not cleanest way to
deal with this?


The initially described issue also occurs when not setting
--enable-debug it just goes silently unnoticed.

I went ahead and applied the below patch to make all ASSUME invocations
log (in a normal build) to see how many more of those messages are going
unnoticed.

------< snip >------
diff --git a/lib/birdlib.h b/lib/birdlib.h
index 23036c1b..76848e2c 100644
--- a/lib/birdlib.h
+++ b/lib/birdlib.h
@@ -176,7 +176,7 @@ void debug(const char *msg, ...);   /* Printf to debug output */
 #endif
 #else
 #define ASSERT(x) do { if (!(x)) log(L_BUG "Assertion '%s' failed at %s:%d", #x, __FILE__, __LINE__); } while(0)
-#define ASSUME(x) /* intentionally left blank */
+#define ASSUME(x) do { if (!(x)) log(L_BUG "Assertion '%s' failed at %s:%d\n", #x, __FILE__, __LINE__); } while(0)
 #endif
------< snip >------

On one of my machines with a few BGP sessions and Babel interfaces I
instructed GDB to print a backtrace whenever the n->prev == NULL
condition fails.

Here the those few that I saw and could distinguish after a very brief look.

> Breakpoint 1, add_tail (n=0x110f298, l=0x7fff7cb58b60) at ./lib/lists.c:80
> #0  add_tail (n=0x110f298, l=0x7fff7cb58b60) at ./lib/lists.c:80
> #1  babel_process_packet (pkt=<optimized out>, len=<optimized out>, saddr=..., ifa=<optimized out>) at proto/babel/packets.c:1393
> #2  0x000000000043e2aa in babel_rx_hook (sk=<optimized out>, len=<optimized out>) at proto/babel/packets.c:1472
> #3  0x000000000040a9d3 in sk_read (s=s at entry=0x110dc80, revents=<optimized out>) at sysdep/unix/io.c:1910
> #4  0x000000000040b394 in io_loop () at sysdep/unix/io.c:2342
> #5  0x000000000040b760 in main (argc=<optimized out>, argv=<optimized out>) at sysdep/unix/main.c:923

> Breakpoint 1, add_tail (n=0x110f6d0, l=0x1108848) at ./lib/lists.c:80
> #0  add_tail (n=0x110f6d0, l=0x1108848) at ./lib/lists.c:80
> #1  babel_enqueue (msg=<optimized out>, ifa=0x11087b0) at proto/babel/packets.c:1309
> #2  0x0000000000445143 in babel_send_seqno_request (p=p at entry=0x109bc20, e=e at entry=0x11050c8, sr=sr at entry=0x1271718) at proto/babel/babel.c:911
> #3  0x00000000004459c5 in babel_add_seqno_request (p=p at entry=0x109bc20, e=e at entry=0x11050c8, router_id=<optimized out>, seqno=<optimized out>, hop_count=hop_count at entry=0 '\000', 
>     nbr=nbr at entry=0x0) at proto/babel/babel.c:337
> #4  0x0000000000445cf3 in babel_select_route (p=p at entry=0x109bc20, e=0x11050c8, mod=<optimized out>) at proto/babel/babel.c:775
> #5  0x0000000000445f13 in babel_update_cost (nbr=nbr at entry=0x1111710) at proto/babel/babel.c:611
> #6  0x000000000044626b in babel_expire_hello (p=p at entry=0x109bc20, nbr=nbr at entry=0x1111710, now_=now_ at entry=242732001004) at proto/babel/babel.c:486
> #7  0x00000000004462ed in babel_expire_neighbors (p=p at entry=0x109bc20) at proto/babel/babel.c:506
> #8  0x00000000004473ae in babel_timer (t=<optimized out>) at proto/babel/babel.c:2072
> #9  0x000000000045da19 in timers_fire (loop=0x4db740 <main_timeloop>) at lib/timer.c:235
> #10 0x000000000040afe9 in io_loop () at sysdep/unix/io.c:2192
> #11 0x000000000040b760 in main (argc=<optimized out>, argv=<optimized out>) at sysdep/unix/main.c:923

> Breakpoint 1, add_tail (n=0x22c61a8, l=0x7ffe4c97c250) at ./lib/lists.c:80
> #0  add_tail (n=0x22c61a8, l=0x7ffe4c97c250) at ./lib/lists.c:80
> #1  babel_send_unicast (msg=0x7ffe4c97c2a0, ifa=0x22bfa20, dest=...) at proto/babel/packets.c:1283
> #2  0x00000000004451d9 in babel_send_route_request (p=<optimized out>, e=<optimized out>, n=0x22c8800) at proto/babel/babel.c:869
> #3  0x0000000000445233 in babel_refresh_route (p=p at entry=0x2252c20, r=r at entry=0x2423108) at proto/babel/babel.c:220
> #4  0x00000000004471c3 in babel_expire_routes_ (p=p at entry=0x2252c20, rtable=rtable at entry=0x2252d58) at proto/babel/babel.c:243
> #5  0x0000000000447388 in babel_expire_routes (p=p at entry=0x2252c20) at proto/babel/babel.c:293
> #6  0x00000000004473a6 in babel_timer (t=<optimized out>) at proto/babel/babel.c:2071
> #7  0x000000000045da19 in timers_fire (loop=0x4db740 <main_timeloop>) at lib/timer.c:235
> #8  0x000000000040afe9 in io_loop () at sysdep/unix/io.c:2192
> #9  0x000000000040b760 in main (argc=<optimized out>, argv=<optimized out>) at sysdep/unix/main.c:923

> Breakpoint 1, add_tail (n=0x234ac30, l=0x241f620) at ./lib/lists.c:82
> #0  add_tail (n=0x234ac30, l=0x241f620) at ./lib/lists.c:82
> #1  sl_alloc (s=0x21bd7d0) at lib/slab.c:264
> #2  0x000000000044f3e2 in rte_get_temp (a=0x22ffe18) at nest/rt-table.c:282
> #3  0x0000000000434220 in bgp_rte_update (s=s at entry=0x7ffe4c97c110, n=n at entry=0x7ffe4c97bf80, path_id=path_id at entry=0, a0=a0 at entry=0x7ffe4c97bfd0) at proto/bgp/packets.c:1326
> #4  0x00000000004359db in bgp_decode_nlri_ip4 (s=0x7ffe4c97c110, pos=<optimized out>, len=<optimized out>, a=0x7ffe4c97bfd0) at proto/bgp/packets.c:1479
> #5  0x0000000000432f2b in bgp_decode_nlri (s=s at entry=0x7ffe4c97c110, afi=afi at entry=65537, nlri=<optimized out>, len=<optimized out>, ea=ea at entry=0x22fa4c0, nh=<optimized out>, 
>     nh_len=<optimized out>) at proto/bgp/packets.c:2410
> #6  0x00000000004383a9 in bgp_rx_update (conn=conn at entry=0x2250728, pkt=pkt at entry=0x22c8a20 '\377' <repeats 16 times>, len=len at entry=227) at proto/bgp/packets.c:2505
> #7  0x0000000000438ef9 in bgp_rx_packet (conn=conn at entry=0x2250728, pkt=pkt at entry=0x22c8a20 '\377' <repeats 16 times>, len=227) at proto/bgp/packets.c:3097
> #8  0x0000000000438fb5 in bgp_rx (sk=0x22c88d0, size=<optimized out>) at proto/bgp/packets.c:3142
> #9  0x00000000004039a0 in call_rx_hook (s=s at entry=0x22c88d0, size=<optimized out>) at sysdep/unix/io.c:1796
> #10 0x000000000040a8dd in sk_read (s=s at entry=0x22c88d0, revents=1) at sysdep/unix/io.c:1884
> #11 0x000000000040b394 in io_loop () at sysdep/unix/io.c:2342
> #12 0x000000000040b760 in main (argc=<optimized out>, argv=<optimized out>) at sysdep/unix/main.c:923

Are these all false positives or should there be more cleanup code on
these list entries?


Sorry for the noise,

Andi


More information about the Bird-users mailing list