netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alex Sidorenko <alexandre.sidorenko@hp.com>
To: "netdev@vger.kernel.org" <netdev@vger.kernel.org>
Subject: Stale entries in RT_TABLE_LOCAL
Date: Wed, 23 Feb 2011 12:43:23 -0500	[thread overview]
Message-ID: <201102231243.23579.alexandre.sidorenko@hp.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 6094 bytes --]

Hello,

I have found several scenarios when after deleting IP-address from an 
interface there is a stale entry left in RT_TABLE_LOCAL.

All these scenarios use the fact that it is possible to add the same address 
multiple times to the same interface using different masks.

Let us do the following using dummy0 interface:

ifconfig dummy0 192.168.140.31 netmask 255.255.252.0
ip addr add 192.168.142.109/23 dev dummy0
ip addr add 192.168.142.109/22 dev dummy0
ip addr del 192.168.142.109/22 dev dummy0
ip addr del 192.168.142.109/23 dev dummy0

We add 192.168.142.109/23 and 192.168.142.109/22, then delete them (order is 
important). After that, 192.168.142.109 is not in 'ip addr ls' but there are 
entries using this addr in RT_TABLE_LOCAL.

An attached script demonstrates the problem:

{asid 14:00:57} sudo sh iptest.sh
Tables before the test
13: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether 5e:1a:fa:44:90:f6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.140.31/22 brd 192.168.143.255 scope global dummy0
    inet6 fe80::5c1a:faff:fe44:90f6/64 scope link 
       valid_lft forever preferred_lft forever
 
local 192.168.140.31 dev dummy0  proto kernel  scope host  src 192.168.140.31 
broadcast 192.168.140.0 dev dummy0  proto kernel  scope link  src 
192.168.140.31 
broadcast 192.168.143.255 dev dummy0  proto kernel  scope link  src
192.168.140.31 
----------------------
Tables after the test
13: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether 5e:1a:fa:44:90:f6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.140.31/22 brd 192.168.143.255 scope global dummy0
    inet6 fe80::5c1a:faff:fe44:90f6/64 scope link 
       valid_lft forever preferred_lft forever
 
local 192.168.140.31 dev dummy0  proto kernel  scope host  src 192.168.140.31 
local 192.168.142.109 dev dummy0  proto kernel  scope host  src 192.168.140.31 
broadcast 192.168.143.255 dev dummy0  proto kernel  scope link  src
192.168.140.31 
broadcast 192.168.143.255 dev dummy0  proto kernel  scope link  src
192.168.142.109 


As you see, even though there is no 192.168.142.109 on dummy0 address list, 
the entries referring to this addr are still present in RT_TABLE_LOCAL.

Another scenario (adding/deleting two addresses, each one twice with different 
mask) can lead to stale entries cross-referencing each other, like

local 192.168.5.8  proto kernel  scope host  src 192.168.5.9 
local 192.168.5.9  proto kernel  scope host  src 192.168.5.8 

Analysis
--------

Both scenarios use the fact that we can add the same address multiple times to 
the same interface, using different masks.

1. When we delete an IP addr, we remove it from the interface addr list and 
send a notifier to routing code (fib_del_ifaddr) asking to delete the 
associated routes.

2. When we enter fib_del_ifaddr(struct in_ifaddr *ifa), the address is already
deleted. But if we add the same IP twice (with different masks), the same
address (even though with different prefix) is present two times. So after the
first deletion we still have its 2nd instance on the list.

3. We do the following in fib_del_ifaddr():

         for (ifa1 = in_dev->ifa_list; ifa1; ifa1 = ifa1->ifa_next) {
                 if (ifa->ifa_local == ifa1->ifa_local)
                         ok |= LOCAL_OK;
                 if (ifa->ifa_broadcast == ifa1->ifa_broadcast)
                         ok |= BRD_OK;
                 if (brd == ifa1->ifa_broadcast)
                         ok |= BRD1_OK;
                 if (any == ifa1->ifa_broadcast)
                         ok |= BRD0_OK;
         }

That is, we loop on all addrs of the interface (in_dev->ifa_list) and compare
address we have just deleted (passed in 'ifa') with addresses on the list. 

As we compare them without taking prefix (mask) into account, the following 
will be true:

ifa->ifa_local == ifa1->ifa_local
ifa->ifa_broadcast == ifa1->ifa_broadcast

4. As a result, after deleting the first instance of IP (192.168.142.109/22) 
we still have  192.168.142.109/23 on the list. The routing code will find that 
this addr (and broadcast) are still present on the list and will not delete 
the routes.

5. When we delete the second time (192.168.142.109/23), there will be no
192.168.142.109 on the list anymore and the routing code will delete the route 
- but only one out of two entries.

How this can be fixed
---------------------

I am not sure what is the best way to fix this, I can think of several 
approaches:

  (a) change the sources so that it would be impossible to add the same IP 
      multiple times, even with different masks. I cannot think of any
      situation where adding the same IP (but with different mask) to the same
      interface could be useful. But maybe I am wrong?

  (b) improve the deletion algorithm in fib_del_ifaddr()

  (c) add a periodic cleanup that will purge all entries from 'local' table if
      there are no corresponding IPs on the interface list


Impact
------

Stale entries in RT_TABLE_LOCAL make ARP reply to requests for that IPs, even 
though these IPs do not belong to any interface.

These scenarios might seem a bit pathological, but in reality they are 
possible on clusters with multiple addresses on several interfaces, where 
addresses are added/deleted for service migration. Address migration can be 
done both by software and by system administrators and if by mistake a wrong 
mask is used, we can get this situation.

And yes, one of HP customers met exactly this problem. They saw a 'duplicate 
IP' issue after migrating some services and found that the host replies to 
ARP-request even though 'ip addr ls' did not show this address. It is not 
common knowledge that ARP implementation uses RT_TABLE_LOCAL to decide whether 
IP is local, so they were unable to understand what is wrong.

Regards,
Alex

------------------------------------------------------------------
Alexandre Sidorenko             email: asid@hp.com
WTEC Linux                      Hewlett-Packard (Canada)
------------------------------------------------------------------

[-- Attachment #2: iptest.sh --]
[-- Type: application/x-shellscript, Size: 597 bytes --]

             reply	other threads:[~2011-02-23 17:43 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-23 17:43 Alex Sidorenko [this message]
2011-03-09 21:53 ` Stale entries in RT_TABLE_LOCAL David Miller
2011-03-10  0:04   ` Julian Anastasov
2011-03-10  1:50     ` Julian Anastasov
2011-03-10 19:41       ` Alex Sidorenko
2011-03-15  8:47         ` Julian Anastasov
2011-03-17 16:11           ` Jiri Bohac
2011-03-18 22:57             ` Julian Anastasov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201102231243.23579.alexandre.sidorenko@hp.com \
    --to=alexandre.sidorenko@hp.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).