Babel in Bird 1.6.0

Toke Høiland-Jørgensen toke at toke.dk
Sat Apr 30 15:15:52 CEST 2016


Baptiste Jonglez <baptiste at bitsofnetworks.org> writes:

> On Sat, Apr 30, 2016 at 01:20:14AM +0200, Toke Høiland-Jørgensen wrote:
>> On 29 April 2016 20:24:43 CEST, Ondrej Zajicek <santiago at crfreenet.org> wrote:
>> >I would like to thank Toke Høiland-Jørgensen for the Babel protocol
>> >implementation, which was finally merged despite my tardy code reviews.
>> 
>> Yay, awesome! You're very welcome, and congratulations on the release! :)
>
> Congrats, Toke!
>
> I have just tested your implementation (from a pure user perspective, I
> haven't looked at the code), and it looks great!

Thanks, and thanks for taking it for a spin! Some comments below.

> Thanks to you, I will have to switch my test overlay network to
> IPv6-only...

Sorry about that ;)

> === Configuration ===
>
> When the quotes are missing around the interface name, the error message
> is misleading.  This configuration:
>
>     protocol babel {
>         interface eth0;
>     }
>
> gives:
>
>     /etc/bird6.conf, line 240: IP address expected
>
> which is very strange, because an IP address does not make sense here
> (and something like "interface 2001:db8::42" obviously gives an
> error).

Not sure why this is. The inner workings of the configuration parsing
still has some a good portion of black magic...

> === Filtering ===
>
> When updating a filter that removes some routes (e.g. switching from
> "export all" to "export none"), the routes stay in the Babel table, even
> though they are no longer exported to the protocol:
>
> bird> show babel entries babel1
> babel1:
> Prefix                        Router ID               Metric Seqno Expires Sources
> 2001:db8:42::/64              <pending>                                          1
> bird> show route export babel1
> bird>

These are routes coming from the box itself (not from peers)? They
should expire after BABEL_HOLD_TIME (10 seconds)...

> === Hello interval range checking ===
>
> The hello interval cannot be set below 1 second: the parser seems to
> expect an integer.  The packet format encodes intervals in centiseconds,
> so it would make sense to allow any fractional Hello interval down to 0.01
> seconds.
>
> I just checked, babeld's parser allows to go as low as 0.001s (which is a
> bit scary in itself, since it generates about 1 kpps / 1.5 Mbps of control
> traffic).

Hmm, yes. However, since the internal Bird timers run at a granularity
of seconds only there's not much point in having the ability to
configure smaller values.

> On a related note, there is no overflow check on the hello interval.  For
> instance, setting the hello interval to 800 seconds produces this in
> tcpdump:
>
>     Hello seqno 1 interval 144.64s

Right, yeah, the overflow check is done on the specified values not
their centisecond equivalents. Oops.

> === Interaction between reconfiguration and timers ===
>
> Changing the hello interval and telling bird to reconfigure itself leads
> to a somewhat inconsistent behaviour:
>
> - when decreasing the hello interval, the old interval is still used for
>   the next Hello message, but the new interval is immediately taken into
>   account for IHU and Update messages.  I don't think this is a real
>   interoperability issue, but it's just strange to see a lot of Updates
>   and IHU without any Hello (when decreasing the hello interval
>   substantially, for instance from one minute to 3 seconds)
>
> Increasing the hello interval works as expected (the old interval is still
> used for the next Hello, IHU and Update messages).  This looks correct and
> consistent with the RFC, section 3.4.1: the interval change is actually
> taken into account when sending the next Hello message, because otherwise
> our neighbours could think we are dead.

Hmm, is this consistent? Reconfiguration simply replaces the struct with
the values in the internal data structures, so the only reason I can see
why you would get that behaviour is because there's a TLV that happens
to be queued at the time you reconfigure. You can verify this by looking
at the debug output when running with TRACE level debugging enabled;
that outputs messages when the TLVs are generated, not when they are
sent out.

I do seem to have forgotten to generate a new hello when reconfiguring
(RFC section 3.4.1: "Equivalently, a node SHOULD send an unscheduled
Hello immediately after increasing its Hello interval."). I do believe
just adding that should resolve the issues with inconsistent behaviour?

> === TLV parsing error ===
>
> When a neighbouring babeld starts or stop, Bird complains about an error
> parsing a TLV.
>
> When starting babeld:
>
> avril 30 12:23:38 lud bird6[17307]: babel1: Bad TLV from fe80::e8db:78ff:fe05:8a64 via tap-fastd type 8 pos 18 - parse error
> avril 30 12:23:38 lud bird6[17307]: babel1: Bad TLV from fe80::e8db:78ff:fe05:8a64 via tap-fastd type 8 pos 18 - parse error
>
> When stopping babeld:
>
> avril 30 12:23:59 lud bird6[17307]: babel1: Bad TLV from fe80::e8db:78ff:fe05:8a64 via tap-fastd type 8 pos 4 - parse error
> avril 30 12:23:59 lud bird6[17307]: babel2: Bad TLV from fe80::e8db:78ff:fe05:8a64 via tap-fastd type 8 pos 4 - parse error
>
> Packet captures are attached (babeld-start-parse-error.pcap and
> babeld-stop-parse-error.pcap).  The local node running Bird is
> fe80::c07d:62ff:fea6:b8f4, all other speakers are babeld 1.5.0.
>
> Type 8 is the Update TLV, I think Bird complains about this particular
> message (here seen by tcpdump):
>
>     Update/id ::/0 metric 65535 seqno 21902 interval infinity

I remember running into this. What happens here is that babeld sends an
update without a preceding router_id TLV, with a wildcard address, but
flag 0x40 set (meaning "infer the router ID from the address").

While I'm not sure what the purpose of this is (a null update with a
null router ID with infinity metric and interval?) it *is* technically
in spec. I think the reason why Bird complains is that Ondrej's cleanup
of my (admittedly messy) packet parsing code inadvertently moved the
check for the 0x40 flag inside the case branch for AE_IP6.

If you turn on debugging you should get a "No router ID seen before
update" message which will confirm that this is indeed the issue.


Most of these issues are fairly trivial fixes. I'll produce a patch once
I'm done grokking Ondrej's code cleanup changes :)

-Toke
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 472 bytes
Desc: not available
URL: <http://trubka.network.cz/pipermail/bird-users/attachments/20160430/6cc35dd4/attachment.asc>


More information about the Bird-users mailing list