Route attributes overwritten during reconfiguration

Lennert Buytenhek buytenh at wantstofly.org
Fri Feb 17 08:06:55 CET 2017


Hello!


On Fri, Nov 11, 2016 at 12:54:28PM +0100, Ondrej Zajicek wrote:

> > I've run into a bug which manifests itself during reconfiguration.
> > Unfortunately, the bird config in question is extremely large and I
> > can't easily cut it down to something replicable. I'm running into
> > this particular issue in bird6, but have no reason to believe IPv4
> > would fair any better. All results are using bird 1.6.2.
> ...
> > in this brief moment, a window of opportunity arises for our thus far
> > silent transit to make an unwanted appearance:
> > 
> >       sudo birdc6 show route for 2001:db8::/48
> >       BIRD 1.6.2 ready.
> >       2001:db8::/48 dev lo [static_protocol 05:44:07] * (200)
> >                    via xxx on et3 [transit_3 05:44:07] (100)
> >                    via xxx on et1 [transit_1 03:44:07] (100)
> >                    via xxx on et2 [transit_2 03:44:07] (100)
> 
> I don't understand this. Even if the route from transit_3 appears during
> the window of opportunity, it should disappear as the static route is
> propagated again to transit_3 and transit_3 selected it (assuming that
> it is the reason why it was not propagated before).
> 
> > This, it turns out, is tragic, because somehow the extended
> > communities we apply to the route learnt from transit_3 clobber the
> > ones we apply in static_protocol.
> 
> How does this ext community clobbering is expressed? Could you document
> it by appropriate 'show route all'?
> 
> > I'm pretty sure something screws up the rta associated to the static
> > route, but I'm also confused as to why the static protocol seems to
> > get reloaded at all given there is no configuration change in the
> > first place.
> 
> That is true, it should not.

I work with João and I managed to cut our (huge) config down to a
minimal reproducer.  Our setup does something like this:

===
router id 1.2.3.4;

log stderr all;
debug protocols all;

protocol device {
	scan time 3600;
}

protocol direct {
}

function function1(string value) {
	print "function1 called with value ", value;
	bgp_ext_community.add((ro, 1, 1));
	if "abcd" = value then { bgp_ext_community.add((ro, 2, 2)); }
}

function function2(string value) {
	print "function2 called with value ", value;
	if "efgh" = value then { bgp_ext_community.add((ro, 3, 3)); }
}

protocol static {
	route 1234::/32 via "lo" {
		function1("abcd");
		function2("efgh");
	};
}
===

There seems to be a problem related to the memory management of
strings used in per static route implicit filter expressions when a
reconfiguration happens.  If I load bird 1.6.3 (built for IPv6) up
with this config file and then send it a couple of SIGHUPs, the
process crashes:

===
# ./bird -c b.conf -d
2017-02-17 08:49:46 <TRACE> device1: Initializing
2017-02-17 08:49:46 <TRACE> direct1: Initializing
2017-02-17 08:49:46 <TRACE> static1: Initializing
2017-02-17 08:49:46 <TRACE> device1: Starting
2017-02-17 08:49:46 <TRACE> device1: Scanning interfaces
2017-02-17 08:49:46 <TRACE> device1: Connected to table master
2017-02-17 08:49:46 <TRACE> device1: State changed to feed
2017-02-17 08:49:46 <TRACE> direct1: Starting
2017-02-17 08:49:46 <TRACE> direct1: Connected to table master
2017-02-17 08:49:46 <TRACE> direct1: State changed to feed
2017-02-17 08:49:46 <TRACE> static1: Starting
2017-02-17 08:49:46 <TRACE> static1: Connected to table master
2017-02-17 08:49:46 <TRACE> static1: State changed to feed
2017-02-17 08:49:46 <INFO> Started
2017-02-17 08:49:46 <TRACE> device1: State changed to up
2017-02-17 08:49:46 <TRACE> direct1 < interface lo goes up
2017-02-17 08:49:46 <TRACE> direct1 < primary address ::1/128 on interface lo added
2017-02-17 08:49:46 <TRACE> direct1 < interface eth0 goes up
2017-02-17 08:49:46 <TRACE> direct1 < primary address fe80::/64 on interface eth0 added
2017-02-17 08:49:46 <TRACE> direct1: State changed to up
2017-02-17 08:49:46 <TRACE> static1 < interface lo goes up
2017-02-17 08:49:46 <INFO> function1 called with value abcd
2017-02-17 08:49:46 <INFO> function2 called with value efgh
2017-02-17 08:49:46 <TRACE> static1 > added [best] 1234::/32 dev lo
2017-02-17 08:49:46 <TRACE> static1 < interface eth0 goes up
2017-02-17 08:49:46 <TRACE> static1: State changed to up
2017-02-17 08:49:51 <INFO> Reconfiguration requested by SIGHUP
2017-02-17 08:49:51 <INFO> Reconfiguring
2017-02-17 08:49:51 <TRACE> device1: Reconfigured
2017-02-17 08:49:51 <TRACE> direct1: Reconfigured
2017-02-17 08:49:51 <TRACE> static1: Reconfigured
2017-02-17 08:49:51 <INFO> Reconfigured
2017-02-17 08:49:52 <INFO> Reconfiguration requested by SIGHUP
2017-02-17 08:49:52 <INFO> Reconfiguring
2017-02-17 08:49:52 <TRACE> device1: Reconfigured
2017-02-17 08:49:52 <TRACE> direct1: Reconfigured
2017-02-17 08:49:52 <TRACE> static1: Reconfigured
2017-02-17 08:49:52 <INFO> Reconfigured
2017-02-17 08:49:52 <INFO> Reconfiguration requested by SIGHUP
2017-02-17 08:49:52 <INFO> Reconfiguring
2017-02-17 08:49:52 <TRACE> device1: Reconfigured
2017-02-17 08:49:52 <TRACE> direct1: Reconfigured
Segmentation fault
===

The crash goes away if I apply this patch (not the right fix, I'm sure):

===
diff --git a/proto/static/static.c b/proto/static/static.c
index 0c088cd..a56771f 100644
--- a/proto/static/static.c
+++ b/proto/static/static.c
@@ -498,7 +498,8 @@ static_same_dest(struct static_route *x, struct static_route *y)
 static inline int
 static_same_rte(struct static_route *x, struct static_route *y)
 {
-  return static_same_dest(x, y) && i_same(x->cmds, y->cmds);
+//  return static_same_dest(x, y) && i_same(x->cmds, y->cmds);
+  return 0;
 }
 
 
===

I think that the reconfiguration code finds the old and the new implicit
filter expression to be the same code, and therefore hangs onto the old
version, but then the underlying memory for some of the old objects gets
freed.  Or something like that.

Running unmodified bird 1.6.3 under valgrind with the config above and
then sending it a bunch of SIGHUPs produces plenty of crap on the second
reconfigure (suggesting that the first reconfigure screws things up):

===
[...]
2017-02-17 08:59:25 <TRACE> direct1: Reconfigured
==3404== Invalid read of size 2
==3404==    at 0x805A499: i_same (filter.c:1539)
==3404==    by 0x805A893: i_same (filter.c:1632)
==3404==    by 0x8085D4C: static_match.isra.13 (static.c:501)
==3404==    by 0x8085F95: static_reconfigure (static.c:557)
==3404==    by 0x8053FB4: protos_commit (proto.c:420)
==3404==    by 0x8095F42: config_do_commit (conf.c:271)
==3404==    by 0x8096172: config_commit (conf.c:361)
==3404==    by 0x809D5C3: async_config (main.c:238)
==3404==    by 0x8099B37: io_loop (io.c:2135)
==3404==    by 0x804A733: main (main.c:885)
==3404==  Address 0x4217b5e is 3,606 bytes inside a block of size 4,088 free'd
==3404==    at 0x4028F4F: free (vg_replace_malloc.c:446)
==3404==    by 0x809E3E9: lp_free (mempool.c:211)
==3404==    by 0x80A2020: pool_free (resource.c:81)
==3404==    by 0x80A20AF: rfree (resource.c:165)
==3404==    by 0x8095E30: config_do_commit (conf.c:250)
==3404==    by 0x8096172: config_commit (conf.c:361)
==3404==    by 0x809D5C3: async_config (main.c:238)
==3404==    by 0x8099B37: io_loop (io.c:2135)
==3404==    by 0x804A733: main (main.c:885)
==3404== 
==3404== Invalid read of size 2
==3404==    at 0x805A4A7: i_same (filter.c:1541)
==3404==    by 0x805A893: i_same (filter.c:1632)
==3404==    by 0x8085D4C: static_match.isra.13 (static.c:501)
==3404==    by 0x8085F95: static_reconfigure (static.c:557)
==3404==    by 0x8053FB4: protos_commit (proto.c:420)
==3404==    by 0x8095F42: config_do_commit (conf.c:271)
==3404==    by 0x8096172: config_commit (conf.c:361)
==3404==    by 0x809D5C3: async_config (main.c:238)
==3404==    by 0x8099B37: io_loop (io.c:2135)
==3404==    by 0x804A733: main (main.c:885)
==3404==  Address 0x4217b5c is 3,604 bytes inside a block of size 4,088 free'd
==3404==    at 0x4028F4F: free (vg_replace_malloc.c:446)
==3404==    by 0x809E3E9: lp_free (mempool.c:211)
==3404==    by 0x80A2020: pool_free (resource.c:81)
==3404==    by 0x80A20AF: rfree (resource.c:165)
==3404==    by 0x8095E30: config_do_commit (conf.c:250)
==3404==    by 0x8096172: config_commit (conf.c:361)
==3404==    by 0x809D5C3: async_config (main.c:238)
==3404==    by 0x8099B37: io_loop (io.c:2135)
==3404==    by 0x804A733: main (main.c:885)
[...]
===

Does this make any sense?  I can dig further but I was hoping that this
would be enough information for you to figure out what's going on and
how to best fix it. :-)

Thank you very much in advance!


Cheers,
Lennert


More information about the Bird-users mailing list