netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hannes Frederic Sowa <hannes@redhat.com>
To: Neil Horman <nhorman@tuxdriver.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>,
	netdev@vger.kernel.org, davem@davemloft.net,
	ddutt@cumulusnetworks.com
Subject: Re: [PATCH net-next] net: change fib behavior based on interface link status
Date: Wed, 03 Jun 2015 16:45:50 +0200	[thread overview]
Message-ID: <1433342750.3163.14.camel@redhat.com> (raw)
In-Reply-To: <20150603142131.GC10203@neilslaptop.think-freely.org>

On Mi, 2015-06-03 at 10:21 -0400, Neil Horman wrote:
> On Wed, Jun 03, 2015 at 04:13:08PM +0200, Hannes Frederic Sowa wrote:
> > On Mi, 2015-06-03 at 09:53 -0400, Neil Horman wrote:
> > > On Tue, Jun 02, 2015 at 11:07:19PM -0400, Andy Gospodarek wrote:
> > > > This patch adds the ability to have the Linux kernel track whether or
> > > > not a particular route should be used based on the link-status of the
> > > > interface associated with the next-hop.
> > > > 
> > > > Before this patch any link-failure on an interface that was serving as a
> > > > gateway for some systems could result in those systems being isolated
> > > > from the rest of the network as the stack would continue to attempt to
> > > > send frames out of an interface that is actually linked-down.  When the
> > > > kernel is responsible for all forwarding, it should also be responsible
> > > > for taking action when the traffic can no longer be forwarded -- there
> > > > is no real need to outsource link-monitoring to userspace anymore.
> > > > 
> > > > This feature is only enabled with the new sysctl set (default is off):
> > > > net.core.kill_routes_on_linkdown = 1
> > > > 
> > > > When this is set, the following behavior can be observed (interface p8p1
> > > > is link-down):
> > > > 
> > > > # ip route show 
> > > > default via 10.0.5.2 dev p9p1 
> > > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead 
> > > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead 
> > > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > > # ip route get 90.0.0.1 
> > > > 90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1 
> > > >     cache 
> > > > # ip route get 80.0.0.1 
> > > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > > >     cache <local> 
> > > > # ip route get 80.0.0.2
> > > > 80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15 
> > > >     cache 
> > > > 
> > > > While the route does remain in the table (so it can be modified if
> > > > needed rather than being wiped away as it would be if IFF_UP was
> > > > cleared), the proper next-hop is chosen automatically when the link is
> > > > down.  Now interface p8p1 is linked-up:
> > > > 
> > > > # ip route show 
> > > > default via 10.0.5.2 dev p9p1 
> > > > 10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15 
> > > > 70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1 
> > > > 80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 
> > > > 90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 
> > > > 90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2 
> > > > 192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2 
> > > > # ip route get 90.0.0.1 
> > > > 90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1 
> > > >     cache 
> > > > # ip route get 80.0.0.1 
> > > > local 80.0.0.1 dev lo  src 80.0.0.1 
> > > >     cache <local> 
> > > > # ip route get 80.0.0.2
> > > > 80.0.0.2 dev p8p1  src 80.0.0.1 
> > > >     cache 
> > > > 
> > > > and the output changes to what one would expect.
> > > > 
> > > > Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
> > > > Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
> > > > 
> > > > ---
> > > > Though there were some that preferred not to have a configuration option
> > > > and to make this behavior the default when it was discussed in Ottawa
> > > > earlier this year since "it was time to do this."  I wanted to propose
> > > > the config option to preserve the current behavior for those that desire
> > > > it.  I'll happily remove it if Dave and Linus approve.
> > > > 
> > > > An IPv6 implementation is also needed (DECnet too!), but I wanted to
> > > > start with the IPv4 implementation to get people comfortable with the
> > > > idea before moving forward.  If this is accepted the IPv6 implementation
> > > > can be posted shortly.  
> > > > 
> > > > FWIW, we have been running this patch with the sysctl setting above and
> > > > our customers have been happily using a backported version for IPv4 and
> > > > IPv6 for >6 months.
> > > > 
> > > >  include/linux/netdevice.h      |  1 +
> > > >  include/net/fib_rules.h        |  1 +
> > > >  include/net/ip_fib.h           |  1 +
> > > >  include/uapi/linux/rtnetlink.h |  1 +
> > > >  include/uapi/linux/sysctl.h    |  1 +
> > > >  kernel/sysctl_binary.c         |  1 +
> > > >  net/core/dev.c                 |  2 ++
> > > >  net/core/sysctl_net_core.c     |  7 +++++++
> > > >  net/ipv4/fib_frontend.c        | 12 +++++++++--
> > > >  net/ipv4/fib_rules.c           |  7 ++++++-
> > > >  net/ipv4/fib_semantics.c       | 46 ++++++++++++++++++++++++++++++++++++------
> > > >  net/ipv4/fib_trie.c            | 19 +++++++++++++----
> > > >  12 files changed, 86 insertions(+), 13 deletions(-)
> > > > 
> > > 
> > > Hey Andy-
> > > 	So, the implementation looks great.  That said, a question comes up in
> > > my mind that I can't quite resolve.  In looking at your code, and seeing how it
> > > fit into the routing path, I came accross the function sync_fib_down_dev, which
> > > appears to be called in response to NETDEV_DOWN events.  Its purpose from my
> > > read is to interrogate the fib for routes that specify the device going down as
> > > the egress device, and mark said routes as dead.  I'm guessing that I'm missing
> > > something here, but it seems as though this patch shouldn't be needed in light
> > > of that behavior (unless the existing code is somehow broken, or I'm confused
> > > about its purpose).  Could it be that whats really needed here is for Network
> > > Manager (or other user space programs ) to stop removing routes in response to
> > > NETDEV_DOWN events so that the above kernel code can do its job?
> > 
> > We don't care about DEAD flag during normal fib lookup, only when
> > considering multipath routes.
> > 
> 
> I get that, but I think the question still applies.  If you have multiple routes
> to the same destination, the sync_fib_down_dev code should still be doing the
> job that andy's patch does, shouldn't it?  I.e. if you have two interfaces with
> ip addresses on the same subnet, and a route for each interface, only the first
> route in the table typically gets used, because its found first.  But if that
> link goes down, the sync_fib_down_dev code should mark the links route as dead,
> and subsequent lookups should use the alternate interface.  But it seems like
> thats not happening, so I'm wondering why that is.

It is a matter of implementation, I think, albeit I would prefer to not
kill all connections during a small link loss.

Given a laptop with unstable link, we would immediately kill all
connections during packet send with this option enabled as error would
propagate to local_error handlers (I hope this is correct, Andy?).

If you have multiple routes to the same destination as in multipath,
yes, this should work and I think will work.
But as soon as you have to backtrack, this is not a semantically
equivalent route to the destination any more, so we should not switch to
it just because of a link event.

I don't think we can change the default easily here, but I like this new
feature.

Bye,
Hannes

  reply	other threads:[~2015-06-03 14:45 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-03  3:07 [PATCH net-next] net: change fib behavior based on interface link status Andy Gospodarek
2015-06-03  5:03 ` Scott Feldman
2015-06-03 15:12   ` Andy Gospodarek
2015-06-03 17:40     ` Scott Feldman
2015-06-03 17:57       ` Andy Gospodarek
2015-06-03 18:12         ` Scott Feldman
2015-06-03 18:16           ` Andy Gospodarek
2015-06-03  9:35 ` Hannes Frederic Sowa
2015-06-03 13:49   ` Andy Gospodarek
2015-06-03 13:53 ` Neil Horman
2015-06-03 14:13   ` Hannes Frederic Sowa
2015-06-03 14:21     ` Neil Horman
2015-06-03 14:45       ` Hannes Frederic Sowa [this message]
2015-06-03 14:46       ` Andy Gospodarek
2015-06-03 15:02         ` Neil Horman
2015-06-03 15:16           ` Andy Gospodarek
2015-06-03 17:47             ` Neil Horman
2015-06-03 15:17           ` Hannes Frederic Sowa
2015-06-03 14:03 ` Hannes Frederic Sowa
2015-06-03 14:24   ` Andy Gospodarek
2015-06-03 18:15 ` Scott Feldman
2015-06-03 18:27   ` Andy Gospodarek
2015-06-03 19:03     ` Scott Feldman
2015-06-03 20:11       ` Andy Gospodarek
2015-06-03 20:04     ` Hannes Frederic Sowa
2015-06-03 20:34       ` Andy Gospodarek
2015-06-03 20:36         ` Hannes Frederic Sowa
2015-06-03 18:25 ` Alexander Duyck
2015-06-03 20:02   ` Andy Gospodarek
2015-06-03 21:01     ` Alexander Duyck
2015-06-05 19:05       ` Andy Gospodarek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1433342750.3163.14.camel@redhat.com \
    --to=hannes@redhat.com \
    --cc=davem@davemloft.net \
    --cc=ddutt@cumulusnetworks.com \
    --cc=gospo@cumulusnetworks.com \
    --cc=netdev@vger.kernel.org \
    --cc=nhorman@tuxdriver.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).