[PATCH] Fix Bird/Bird6 wrong LSA collisions detection

Mikkelsen, Asbjorn amikkelsen at verisign.com
Thu Nov 8 16:45:06 CET 2018


It looks like your patch should be able to resolve the issue as well, and is more succinct.

When we have the environment again to test this, we will verify if your proposed patch works as expected in our use case.

Thanks again for looking into this.

Asbjorn


On 06.11.18, 18:19, "Ondrej Zajicek" <santiago at crfreenet.org> wrote:

    On Tue, Nov 06, 2018 at 04:15:34PM +0100, Ondrej Zajicek wrote:
    > On Tue, Nov 06, 2018 at 08:53:35AM +0000, Mikkelsen, Asbjorn wrote:
    > > Bird/Bird6 can wrongly report wrong LSA collisions and stop working. A route
    > > fib can in certain cases be released (believing nobody are using them) and
    > > still be used (but not dereferenced).
    > > 
    > > A PR can be found here:
    > > https://github.com/BIRD/bird/pull/3
    > 
    > Hi
    > 
    > Thanks for the patch, i have some questions:
    > 
    > 1) Am i understand correctly the cause that there is struct top_hash_entry
    > and associated struct ort entry, but struct ort entry got removed while
    > struct top_hash_entry still links to it, so when new struct ort for given
    > network is allocated, it is technically different from old one and that
    > cause the collision?
    
    One more question - in your scenario from PR:
    
        1. Bird OSPF is running fine
    
        2. Cut the network connectivity
    
        3. Then quickly enough, and before Bird realizes that the network is down, flush bird OSPF routes
    
        4. Bird will flush its routes but none of them will be removed because there are no OSPF routers to ACK this route flushing
    
        5. Wait long enough for Bird to realize that all neighbors are gone (~5 minutes)
    
        6. Bird will delete the corresponding route fibs, believing nobody is using them
    
        7. Re-add Bird OSPF routes and then Bird will try to read the corresponding deleted fib
    
        8. Bird OSPF does not work anymore and we see this in the log:
        <ERR> ospf1: LSA ID collision for X.X.X.X/32
    
    It seems to me that it may happen after route are flushed, ort entries
    got removed (in rt_sync()), but before LSA entries are removed (by
    ospf_clear_lsa()). After ospf_clear_lsa() en->lsa_body is NULL and
    therefore the issue will not happen.
    
    If i understand it correctly, that may happen after the flush but
    *before* all neighbors are gone (because after all neighbors are gone,
    LSA entries will be removed from ospf_update_lsadb() like if they are
    ACKed). That is slightly different than the scenario from PR.
    
    I think that the proper fix is to reset en->nf field during
    ospf_flush_lsa(). Patch attached, could you try it?
    
    Your patch should also work, but there is a corner case when route R1 is
    exported, then flushed (but not ACKed), then exported again in short time
    after the first export so triggering MinLSInterval check in
    ospf_do_originate_lsa(), so now we have LSA entry that is MAXAGE but with
    scheduled LSA in en->next_lsa_body, then a different route R2 with real
    LSA ID collision is exported, it should report collision, but is ignored
    by the condition in the patch.
    
    -- 
    Elen sila lumenn' omentielvo
    
    Ondrej 'Santiago' Zajicek (email: santiago at crfreenet.org)
    OpenPGP encrypted e-mails preferred (KeyID 0x11DEADC3, wwwkeys.pgp.net)
    "To err is human -- to blame it on a computer is even more so."
    




More information about the Bird-users mailing list