* [PATCH net-next 0/3 v2] changes to make ipv4 routing table aware of next-hop link status
@ 2015-06-10 6:47 Andy Gospodarek
2015-06-10 6:47 ` [PATCH net-next 1/3 v2] net: track link-status of ipv4 nexthops Andy Gospodarek
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Andy Gospodarek @ 2015-06-10 6:47 UTC (permalink / raw)
To: netdev, davem, ddutt, sfeldma, alexander.duyck, hannes, stephen
Cc: Andy Gospodarek
This series adds the ability to have the Linux kernel track whether or
not a particular route should be used based on the link-status of the
interface associated with the next-hop.
Before this patch any link-failure on an interface that was serving as a
gateway for some systems could result in those systems being isolated
from the rest of the network as the stack would continue to attempt to
send frames out of an interface that is actually linked-down. When the
kernel is responsible for all forwarding, it should also be responsible
for taking action when the traffic can no longer be forwarded -- there
is no real need to outsource link-monitoring to userspace anymore.
This feature is only enabled with the new per-interface or ipv4 global
sysctls called 'ignore_routes_with_linkdown'.
net.ipv4.conf.all.ignore_routes_with_linkdown = 0
net.ipv4.conf.default.ignore_routes_with_linkdown = 0
net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
...
When the above sysctls are set, the kernel will not only report to
userspace that the link is down, but it will also report to userspace
that a route is dead. This will signal to userspace that the route will
not be selected.
With the new sysctls set, the following behavior can be observed
(interface p8p1 is link-down):
# ip route show
default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15
70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1
80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 dead linkdown
90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 dead linkdown
90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2
# ip route get 90.0.0.1
90.0.0.1 via 70.0.0.2 dev p7p1 src 70.0.0.1
cache
# ip route get 80.0.0.1
local 80.0.0.1 dev lo src 80.0.0.1
cache <local>
# ip route get 80.0.0.2
80.0.0.2 via 10.0.5.2 dev p9p1 src 10.0.5.15
cache
While the route does remain in the table (so it can be modified if
needed rather than being wiped away as it would be if IFF_UP was
cleared), the proper next-hop is chosen automatically when the link is
down. Now interface p8p1 is linked-up:
# ip route show
default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15
70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1
80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1
90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1
90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2
192.168.56.0/24 dev p2p1 proto kernel scope link src 192.168.56.2
# ip route get 90.0.0.1
90.0.0.1 via 80.0.0.2 dev p8p1 src 80.0.0.1
cache
# ip route get 80.0.0.1
local 80.0.0.1 dev lo src 80.0.0.1
cache <local>
# ip route get 80.0.0.2
80.0.0.2 dev p8p1 src 80.0.0.1
cache
and the output changes to what one would expect.
If the global or interface sysctl is not set, the following output would be
expected when p8p1 is down:
# ip route show
default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15
70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1
80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 linkdown
90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 linkdown
90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2
If the dead flag does not appear there should be no expectation that the
kernel would skip using this route due to link being down.
v2: Split kernel changes into 2 patches: first to add linkdown flag and
second to add new sysctl settings. Also took suggestion from Alex to
simplify code by only checking sysctl during fib lookup and suggestion
from Scott to add a per-interface sysctl. Added iproute2 patch to
recognize and print linkdown flag.
Though there were some that preferred not to have a configuration option
and to make this behavior the default when it was discussed in Ottawa
earlier this year since "it was time to do this." I wanted to propose
the config option to preserve the current behavior for those that desire
it. I'll happily remove it if Dave and Linus approve.
An IPv6 implementation is also needed (DECnet too!), but I wanted to start with
the IPv4 implementation to get people comfortable with the idea before moving
forward. If this is accepted the IPv6 implementation can be posted shortly.
There was also a request for switchdev support for this, but that will be
posted as a followup as switchdev does not currently handle dead
next-hops in a multi-path case and I felt that infra needed to be added
first.
FWIW, we have been running the original version of this series with a
global sysctl and our customers have been happily using a backported
version for IPv4 and IPv6 for >6 months.
Andy Gospodarek (3):
net: track link-status of ipv4 nexthops
net: ipv4 sysctl option to ignore routes when nexthop link is down
iproute2: add support to print 'linkdown' nexthop flag
include/linux/inetdevice.h | 3 ++
include/net/fib_rules.h | 3 +-
include/net/ip_fib.h | 21 +++++-----
include/uapi/linux/ip.h | 1 +
include/uapi/linux/rtnetlink.h | 1 +
include/uapi/linux/sysctl.h | 1 +
kernel/sysctl_binary.c | 1 +
net/ipv4/devinet.c | 2 +
net/ipv4/fib_frontend.c | 32 ++++++++-------
net/ipv4/fib_rules.c | 5 ++-
net/ipv4/fib_semantics.c | 84 ++++++++++++++++++++++++++++++---------
net/ipv4/fib_trie.c | 7 ++++
net/ipv4/netfilter/ipt_rpfilter.c | 2 +-
net/ipv4/route.c | 10 ++---
ip/iproute.c | 4 ++
15 files changed, 125 insertions(+), 52 deletions(-)
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH net-next 1/3 v2] net: track link-status of ipv4 nexthops 2015-06-10 6:47 [PATCH net-next 0/3 v2] changes to make ipv4 routing table aware of next-hop link status Andy Gospodarek @ 2015-06-10 6:47 ` Andy Gospodarek 2015-06-10 15:57 ` Alexander Duyck 2015-06-10 6:47 ` [PATCH net-next 2/3 v2] net: ipv4 sysctl option to ignore routes when nexthop link is down Andy Gospodarek 2015-06-10 6:47 ` [PATCH iproute2 3/3 v2] add support to print 'linkdown' nexthop flag Andy Gospodarek 2 siblings, 1 reply; 12+ messages in thread From: Andy Gospodarek @ 2015-06-10 6:47 UTC (permalink / raw) To: netdev, davem, ddutt, sfeldma, alexander.duyck, hannes, stephen Cc: Andy Gospodarek Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are reachable via an interface where carrier is off. No action is taken, but additional flags are passed to userspace to indicate carrier status. This also includes a cleanup to fib_disable_ip to more clearly indicate what event made the function call to replace the more cryptic force option previously used. v2: Split out kernel functionality into 2 patches, this patch simply sets and clears new nexthop flag RTNH_F_LINKDOWN. Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com> --- include/net/ip_fib.h | 4 +-- include/uapi/linux/rtnetlink.h | 1 + net/ipv4/fib_frontend.c | 26 +++++++++++--------- net/ipv4/fib_semantics.c | 56 ++++++++++++++++++++++++++++++++---------- 4 files changed, 60 insertions(+), 27 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 54271ed..d1de1b7 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -305,9 +305,9 @@ void fib_flush_external(struct net *net); /* Exported by fib_semantics.c */ int ip_fib_check_default(__be32 gw, struct net_device *dev); -int fib_sync_down_dev(struct net_device *dev, int force); +int fib_sync_down_dev(struct net_device *dev, int event); int fib_sync_down_addr(struct net *net, __be32 local); -int fib_sync_up(struct net_device *dev); +int fib_sync_up(struct net_device *dev, unsigned int nh_flags); void fib_select_multipath(struct fib_result *res); /* Exported by fib_trie.c */ diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 17fb02f..8dde432 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -338,6 +338,7 @@ struct rtnexthop { #define RTNH_F_PERVASIVE 2 /* Do recursive gateway lookup */ #define RTNH_F_ONLINK 4 /* Gateway is forced on link */ #define RTNH_F_OFFLOAD 8 /* offloaded route */ +#define RTNH_F_LINKDOWN 16 /* carrier-down on nexthop */ /* Macros to handle hexthops */ diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 872494e..1e4c646 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -1063,9 +1063,9 @@ static void nl_fib_lookup_exit(struct net *net) net->ipv4.fibnl = NULL; } -static void fib_disable_ip(struct net_device *dev, int force) +static void fib_disable_ip(struct net_device *dev, int event) { - if (fib_sync_down_dev(dev, force)) + if (fib_sync_down_dev(dev, event)) fib_flush(dev_net(dev)); rt_cache_flush(dev_net(dev)); arp_ifdown(dev); @@ -1080,9 +1080,7 @@ static int fib_inetaddr_event(struct notifier_block *this, unsigned long event, switch (event) { case NETDEV_UP: fib_add_ifaddr(ifa); -#ifdef CONFIG_IP_ROUTE_MULTIPATH - fib_sync_up(dev); -#endif + fib_sync_up(dev, RTNH_F_DEAD); atomic_inc(&net->ipv4.dev_addr_genid); rt_cache_flush(dev_net(dev)); break; @@ -1093,7 +1091,7 @@ static int fib_inetaddr_event(struct notifier_block *this, unsigned long event, /* Last address was deleted from this interface. * Disable IP. */ - fib_disable_ip(dev, 1); + fib_disable_ip(dev, event); } else { rt_cache_flush(dev_net(dev)); } @@ -1107,9 +1105,10 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo struct net_device *dev = netdev_notifier_info_to_dev(ptr); struct in_device *in_dev; struct net *net = dev_net(dev); + unsigned flags; if (event == NETDEV_UNREGISTER) { - fib_disable_ip(dev, 2); + fib_disable_ip(dev, event); rt_flush_dev(dev); return NOTIFY_DONE; } @@ -1123,17 +1122,20 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo for_ifa(in_dev) { fib_add_ifaddr(ifa); } endfor_ifa(in_dev); -#ifdef CONFIG_IP_ROUTE_MULTIPATH - fib_sync_up(dev); -#endif + fib_sync_up(dev, RTNH_F_DEAD); atomic_inc(&net->ipv4.dev_addr_genid); rt_cache_flush(net); break; case NETDEV_DOWN: - fib_disable_ip(dev, 0); + fib_disable_ip(dev, event); break; - case NETDEV_CHANGEMTU: case NETDEV_CHANGE: + flags = dev_get_flags(dev); + if (flags & (IFF_RUNNING|IFF_LOWER_UP)) + fib_sync_up(dev, RTNH_F_LINKDOWN); + else + fib_sync_down_dev(dev, event); + case NETDEV_CHANGEMTU: rt_cache_flush(net); break; } diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 28ec3c1..776e029 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -266,7 +266,7 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi) #ifdef CONFIG_IP_ROUTE_CLASSID nh->nh_tclassid != onh->nh_tclassid || #endif - ((nh->nh_flags ^ onh->nh_flags) & ~RTNH_F_DEAD)) + ((nh->nh_flags ^ onh->nh_flags) & ~(RTNH_F_DEAD|RTNH_F_LINKDOWN))) return -1; onh++; } endfor_nexthops(fi); @@ -318,7 +318,7 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi) nfi->fib_type == fi->fib_type && memcmp(nfi->fib_metrics, fi->fib_metrics, sizeof(u32) * RTAX_MAX) == 0 && - ((nfi->fib_flags ^ fi->fib_flags) & ~RTNH_F_DEAD) == 0 && + ((nfi->fib_flags ^ fi->fib_flags) & ~(RTNH_F_DEAD|RTNH_F_LINKDOWN)) == 0 && (nfi->fib_nhs == 0 || nh_comp(fi, nfi) == 0)) return fi; } @@ -604,6 +604,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, return -ENODEV; if (!(dev->flags & IFF_UP)) return -ENETDOWN; + if (!netif_carrier_ok(dev)) + nh->nh_flags |= RTNH_F_LINKDOWN; nh->nh_dev = dev; dev_hold(dev); nh->nh_scope = RT_SCOPE_LINK; @@ -636,6 +638,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, if (!dev) goto out; dev_hold(dev); + if (!netif_carrier_ok(dev)) + nh->nh_flags |= RTNH_F_LINKDOWN; err = (dev->flags & IFF_UP) ? 0 : -ENETDOWN; } else { struct in_device *in_dev; @@ -654,6 +658,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, nh->nh_dev = in_dev->dev; dev_hold(nh->nh_dev); nh->nh_scope = RT_SCOPE_HOST; + if (!netif_carrier_ok(nh->nh_dev)) + nh->nh_flags |= RTNH_F_LINKDOWN; err = 0; } out: @@ -920,11 +926,17 @@ struct fib_info *fib_create_info(struct fib_config *cfg) if (!nh->nh_dev) goto failure; } else { + int linkdown = 0; change_nexthops(fi) { err = fib_check_nh(cfg, fi, nexthop_nh); if (err != 0) goto failure; + if (nexthop_nh->nh_flags & RTNH_F_LINKDOWN) + linkdown++; } endfor_nexthops(fi) + if (linkdown == fi->fib_nhs) { + fi->fib_flags |= RTNH_F_LINKDOWN; + } } if (fi->fib_prefsrc) { @@ -1103,7 +1115,7 @@ int fib_sync_down_addr(struct net *net, __be32 local) return ret; } -int fib_sync_down_dev(struct net_device *dev, int force) +int fib_sync_down_dev(struct net_device *dev, int event) { int ret = 0; int scope = RT_SCOPE_NOWHERE; @@ -1112,7 +1124,7 @@ int fib_sync_down_dev(struct net_device *dev, int force) struct hlist_head *head = &fib_info_devhash[hash]; struct fib_nh *nh; - if (force) + if (event == NETDEV_UNREGISTER) scope = -1; hlist_for_each_entry(nh, head, nh_hash) { @@ -1129,7 +1141,15 @@ int fib_sync_down_dev(struct net_device *dev, int force) dead++; else if (nexthop_nh->nh_dev == dev && nexthop_nh->nh_scope != scope) { - nexthop_nh->nh_flags |= RTNH_F_DEAD; + switch (event) { + case NETDEV_DOWN: + case NETDEV_UNREGISTER: + nexthop_nh->nh_flags |= RTNH_F_DEAD; + /* fall through */ + case NETDEV_CHANGE: + nexthop_nh->nh_flags |= RTNH_F_LINKDOWN; + break; + } #ifdef CONFIG_IP_ROUTE_MULTIPATH spin_lock_bh(&fib_multipath_lock); fi->fib_power -= nexthop_nh->nh_power; @@ -1139,14 +1159,22 @@ int fib_sync_down_dev(struct net_device *dev, int force) dead++; } #ifdef CONFIG_IP_ROUTE_MULTIPATH - if (force > 1 && nexthop_nh->nh_dev == dev) { + if (event == NETDEV_UNREGISTER && nexthop_nh->nh_dev == dev) { dead = fi->fib_nhs; break; } #endif } endfor_nexthops(fi) if (dead == fi->fib_nhs) { - fi->fib_flags |= RTNH_F_DEAD; + switch (event) { + case NETDEV_DOWN: + case NETDEV_UNREGISTER: + fi->fib_flags |= RTNH_F_DEAD; + /* fall through */ + case NETDEV_CHANGE: + fi->fib_flags |= RTNH_F_LINKDOWN; + break; + } ret++; } } @@ -1210,13 +1238,11 @@ out: return; } -#ifdef CONFIG_IP_ROUTE_MULTIPATH - /* * Dead device goes up. We wake up dead nexthops. * It takes sense only on multipath routes. */ -int fib_sync_up(struct net_device *dev) +int fib_sync_up(struct net_device *dev, unsigned int nh_flags) { struct fib_info *prev_fi; unsigned int hash; @@ -1243,7 +1269,7 @@ int fib_sync_up(struct net_device *dev) prev_fi = fi; alive = 0; change_nexthops(fi) { - if (!(nexthop_nh->nh_flags & RTNH_F_DEAD)) { + if (!(nexthop_nh->nh_flags & nh_flags)) { alive++; continue; } @@ -1254,14 +1280,16 @@ int fib_sync_up(struct net_device *dev) !__in_dev_get_rtnl(dev)) continue; alive++; +#ifdef CONFIG_IP_ROUTE_MULTIPATH spin_lock_bh(&fib_multipath_lock); nexthop_nh->nh_power = 0; - nexthop_nh->nh_flags &= ~RTNH_F_DEAD; + nexthop_nh->nh_flags &= ~nh_flags; spin_unlock_bh(&fib_multipath_lock); +#endif } endfor_nexthops(fi) if (alive > 0) { - fi->fib_flags &= ~RTNH_F_DEAD; + fi->fib_flags &= ~nh_flags; ret++; } } @@ -1269,6 +1297,8 @@ int fib_sync_up(struct net_device *dev) return ret; } +#ifdef CONFIG_IP_ROUTE_MULTIPATH + /* * The algorithm is suboptimal, but it provides really * fair weighted route distribution. -- 1.9.3 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 1/3 v2] net: track link-status of ipv4 nexthops 2015-06-10 6:47 ` [PATCH net-next 1/3 v2] net: track link-status of ipv4 nexthops Andy Gospodarek @ 2015-06-10 15:57 ` Alexander Duyck 2015-06-10 17:44 ` Andy Gospodarek 0 siblings, 1 reply; 12+ messages in thread From: Alexander Duyck @ 2015-06-10 15:57 UTC (permalink / raw) To: Andy Gospodarek, netdev, davem, ddutt, sfeldma, alexander.duyck, hannes, stephen On 06/09/2015 11:47 PM, Andy Gospodarek wrote: > Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are > reachable via an interface where carrier is off. No action is taken, > but additional flags are passed to userspace to indicate carrier status. > > This also includes a cleanup to fib_disable_ip to more clearly indicate > what event made the function call to replace the more cryptic force > option previously used. > > v2: Split out kernel functionality into 2 patches, this patch simply sets and > clears new nexthop flag RTNH_F_LINKDOWN. > > Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> > Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com> > > --- > include/net/ip_fib.h | 4 +-- > include/uapi/linux/rtnetlink.h | 1 + > net/ipv4/fib_frontend.c | 26 +++++++++++--------- > net/ipv4/fib_semantics.c | 56 ++++++++++++++++++++++++++++++++---------- > 4 files changed, 60 insertions(+), 27 deletions(-) > > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h > index 54271ed..d1de1b7 100644 > --- a/include/net/ip_fib.h > +++ b/include/net/ip_fib.h > @@ -305,9 +305,9 @@ void fib_flush_external(struct net *net); > > /* Exported by fib_semantics.c */ > int ip_fib_check_default(__be32 gw, struct net_device *dev); > -int fib_sync_down_dev(struct net_device *dev, int force); > +int fib_sync_down_dev(struct net_device *dev, int event); > int fib_sync_down_addr(struct net *net, __be32 local); > -int fib_sync_up(struct net_device *dev); > +int fib_sync_up(struct net_device *dev, unsigned int nh_flags); > void fib_select_multipath(struct fib_result *res); > > /* Exported by fib_trie.c */ > diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h > index 17fb02f..8dde432 100644 > --- a/include/uapi/linux/rtnetlink.h > +++ b/include/uapi/linux/rtnetlink.h > @@ -338,6 +338,7 @@ struct rtnexthop { > #define RTNH_F_PERVASIVE 2 /* Do recursive gateway lookup */ > #define RTNH_F_ONLINK 4 /* Gateway is forced on link */ > #define RTNH_F_OFFLOAD 8 /* offloaded route */ > +#define RTNH_F_LINKDOWN 16 /* carrier-down on nexthop */ So you could probably use some sort of define here to identify which flags are event based and which are configuration based. Then it makes it easier to take care of code below such as the nh_comp call. > /* Macros to handle hexthops */ > > diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c > index 872494e..1e4c646 100644 > --- a/net/ipv4/fib_frontend.c > +++ b/net/ipv4/fib_frontend.c > @@ -1063,9 +1063,9 @@ static void nl_fib_lookup_exit(struct net *net) > net->ipv4.fibnl = NULL; > } > > -static void fib_disable_ip(struct net_device *dev, int force) > +static void fib_disable_ip(struct net_device *dev, int event) Event should be an unsigned long to match fib_inetaddr_event and avoid any unnecessary casts or warnings. > { > - if (fib_sync_down_dev(dev, force)) > + if (fib_sync_down_dev(dev, event)) > fib_flush(dev_net(dev)); > rt_cache_flush(dev_net(dev)); > arp_ifdown(dev); > @@ -1080,9 +1080,7 @@ static int fib_inetaddr_event(struct notifier_block *this, unsigned long event, > switch (event) { > case NETDEV_UP: > fib_add_ifaddr(ifa); > -#ifdef CONFIG_IP_ROUTE_MULTIPATH > - fib_sync_up(dev); > -#endif > + fib_sync_up(dev, RTNH_F_DEAD); > atomic_inc(&net->ipv4.dev_addr_genid); > rt_cache_flush(dev_net(dev)); > break; Shouldn't this bit be left wrapped in CONFIG_IP_ROUTE_MULTIPATH? I thought RTNH_F_DEAD was only used in that case. > @@ -1093,7 +1091,7 @@ static int fib_inetaddr_event(struct notifier_block *this, unsigned long event, > /* Last address was deleted from this interface. > * Disable IP. > */ > - fib_disable_ip(dev, 1); > + fib_disable_ip(dev, event); > } else { > rt_cache_flush(dev_net(dev)); > } Aren't you losing information here? The line above this change is a call to see if ifa_list is NULL. I don't see how that data is being communicated down to fib_disable_ip. It seems like you could end up with the wrong scope. > @@ -1107,9 +1105,10 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo > struct net_device *dev = netdev_notifier_info_to_dev(ptr); > struct in_device *in_dev; > struct net *net = dev_net(dev); > + unsigned flags; > > if (event == NETDEV_UNREGISTER) { > - fib_disable_ip(dev, 2); > + fib_disable_ip(dev, event); > rt_flush_dev(dev); > return NOTIFY_DONE; > } > @@ -1123,17 +1122,20 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo > for_ifa(in_dev) { > fib_add_ifaddr(ifa); > } endfor_ifa(in_dev); > -#ifdef CONFIG_IP_ROUTE_MULTIPATH > - fib_sync_up(dev); > -#endif > + fib_sync_up(dev, RTNH_F_DEAD); > atomic_inc(&net->ipv4.dev_addr_genid); > rt_cache_flush(net); > break; This seems like it is probably a behavior change. You should probably leave this wrapped in the ifdef. > case NETDEV_DOWN: > - fib_disable_ip(dev, 0); > + fib_disable_ip(dev, event); > break; > - case NETDEV_CHANGEMTU: > case NETDEV_CHANGE: > + flags = dev_get_flags(dev); > + if (flags & (IFF_RUNNING|IFF_LOWER_UP)) > + fib_sync_up(dev, RTNH_F_LINKDOWN); > + else > + fib_sync_down_dev(dev, event); > + case NETDEV_CHANGEMTU: > rt_cache_flush(net); > break; > } > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c > index 28ec3c1..776e029 100644 > --- a/net/ipv4/fib_semantics.c > +++ b/net/ipv4/fib_semantics.c > @@ -266,7 +266,7 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi) > #ifdef CONFIG_IP_ROUTE_CLASSID > nh->nh_tclassid != onh->nh_tclassid || > #endif > - ((nh->nh_flags ^ onh->nh_flags) & ~RTNH_F_DEAD)) > + ((nh->nh_flags ^ onh->nh_flags) & ~(RTNH_F_DEAD|RTNH_F_LINKDOWN))) > return -1; > onh++; > } endfor_nexthops(fi); > @@ -318,7 +318,7 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi) > nfi->fib_type == fi->fib_type && > memcmp(nfi->fib_metrics, fi->fib_metrics, > sizeof(u32) * RTAX_MAX) == 0 && > - ((nfi->fib_flags ^ fi->fib_flags) & ~RTNH_F_DEAD) == 0 && > + ((nfi->fib_flags ^ fi->fib_flags) & ~(RTNH_F_DEAD|RTNH_F_LINKDOWN)) == 0 && > (nfi->fib_nhs == 0 || nh_comp(fi, nfi) == 0)) > return fi; > } Merging the two flags into some sort of define would probably help the readability here. > @@ -604,6 +604,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, > return -ENODEV; > if (!(dev->flags & IFF_UP)) > return -ENETDOWN; > + if (!netif_carrier_ok(dev)) > + nh->nh_flags |= RTNH_F_LINKDOWN; > nh->nh_dev = dev; > dev_hold(dev); > nh->nh_scope = RT_SCOPE_LINK; > @@ -636,6 +638,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, > if (!dev) > goto out; > dev_hold(dev); > + if (!netif_carrier_ok(dev)) > + nh->nh_flags |= RTNH_F_LINKDOWN; > err = (dev->flags & IFF_UP) ? 0 : -ENETDOWN; > } else { > struct in_device *in_dev; > @@ -654,6 +658,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, > nh->nh_dev = in_dev->dev; > dev_hold(nh->nh_dev); > nh->nh_scope = RT_SCOPE_HOST; > + if (!netif_carrier_ok(nh->nh_dev)) > + nh->nh_flags |= RTNH_F_LINKDOWN; > err = 0; > } > out: > @@ -920,11 +926,17 @@ struct fib_info *fib_create_info(struct fib_config *cfg) > if (!nh->nh_dev) > goto failure; > } else { > + int linkdown = 0; > change_nexthops(fi) { > err = fib_check_nh(cfg, fi, nexthop_nh); > if (err != 0) > goto failure; > + if (nexthop_nh->nh_flags & RTNH_F_LINKDOWN) > + linkdown++; > } endfor_nexthops(fi) > + if (linkdown == fi->fib_nhs) { > + fi->fib_flags |= RTNH_F_LINKDOWN; > + } > } > > if (fi->fib_prefsrc) { > @@ -1103,7 +1115,7 @@ int fib_sync_down_addr(struct net *net, __be32 local) > return ret; > } > > -int fib_sync_down_dev(struct net_device *dev, int force) > +int fib_sync_down_dev(struct net_device *dev, int event) I believe event should be unsigned long to match the original argument from fib_inetaddr_event. > { > int ret = 0; > int scope = RT_SCOPE_NOWHERE; > @@ -1112,7 +1124,7 @@ int fib_sync_down_dev(struct net_device *dev, int force) > struct hlist_head *head = &fib_info_devhash[hash]; > struct fib_nh *nh; > > - if (force) > + if (event == NETDEV_UNREGISTER) > scope = -1; > So I believe there is still a gap here in relation to fib_inetaddr_event. Specifically in the case of that function it is supposed to set the force value to 1 which would trigger this bit of code, but that isn't occurring with your change. > hlist_for_each_entry(nh, head, nh_hash) { > @@ -1129,7 +1141,15 @@ int fib_sync_down_dev(struct net_device *dev, int force) > dead++; > else if (nexthop_nh->nh_dev == dev && > nexthop_nh->nh_scope != scope) { > - nexthop_nh->nh_flags |= RTNH_F_DEAD; > + switch (event) { > + case NETDEV_DOWN: > + case NETDEV_UNREGISTER: > + nexthop_nh->nh_flags |= RTNH_F_DEAD; > + /* fall through */ > + case NETDEV_CHANGE: > + nexthop_nh->nh_flags |= RTNH_F_LINKDOWN; > + break; > + } > #ifdef CONFIG_IP_ROUTE_MULTIPATH > spin_lock_bh(&fib_multipath_lock); > fi->fib_power -= nexthop_nh->nh_power; > @@ -1139,14 +1159,22 @@ int fib_sync_down_dev(struct net_device *dev, int force) > dead++; > } > #ifdef CONFIG_IP_ROUTE_MULTIPATH > - if (force > 1 && nexthop_nh->nh_dev == dev) { > + if (event == NETDEV_UNREGISTER && nexthop_nh->nh_dev == dev) { > dead = fi->fib_nhs; > break; > } > #endif > } endfor_nexthops(fi) > if (dead == fi->fib_nhs) { > - fi->fib_flags |= RTNH_F_DEAD; > + switch (event) { > + case NETDEV_DOWN: > + case NETDEV_UNREGISTER: > + fi->fib_flags |= RTNH_F_DEAD; > + /* fall through */ > + case NETDEV_CHANGE: > + fi->fib_flags |= RTNH_F_LINKDOWN; > + break; > + } > ret++; > } > } > @@ -1210,13 +1238,11 @@ out: > return; > } > > -#ifdef CONFIG_IP_ROUTE_MULTIPATH > - > /* > * Dead device goes up. We wake up dead nexthops. > * It takes sense only on multipath routes. > */ > -int fib_sync_up(struct net_device *dev) > +int fib_sync_up(struct net_device *dev, unsigned int nh_flags) > { > struct fib_info *prev_fi; > unsigned int hash; > @@ -1243,7 +1269,7 @@ int fib_sync_up(struct net_device *dev) > prev_fi = fi; > alive = 0; > change_nexthops(fi) { > - if (!(nexthop_nh->nh_flags & RTNH_F_DEAD)) { > + if (!(nexthop_nh->nh_flags & nh_flags)) { > alive++; > continue; > } > @@ -1254,14 +1280,16 @@ int fib_sync_up(struct net_device *dev) > !__in_dev_get_rtnl(dev)) > continue; > alive++; > +#ifdef CONFIG_IP_ROUTE_MULTIPATH > spin_lock_bh(&fib_multipath_lock); > nexthop_nh->nh_power = 0; > - nexthop_nh->nh_flags &= ~RTNH_F_DEAD; > + nexthop_nh->nh_flags &= ~nh_flags; > spin_unlock_bh(&fib_multipath_lock); > +#endif > } endfor_nexthops(fi) > > if (alive > 0) { > - fi->fib_flags &= ~RTNH_F_DEAD; > + fi->fib_flags &= ~nh_flags; > ret++; > } > } > @@ -1269,6 +1297,8 @@ int fib_sync_up(struct net_device *dev) > return ret; > } > > +#ifdef CONFIG_IP_ROUTE_MULTIPATH > + > /* > * The algorithm is suboptimal, but it provides really > * fair weighted route distribution. > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 1/3 v2] net: track link-status of ipv4 nexthops 2015-06-10 15:57 ` Alexander Duyck @ 2015-06-10 17:44 ` Andy Gospodarek 0 siblings, 0 replies; 12+ messages in thread From: Andy Gospodarek @ 2015-06-10 17:44 UTC (permalink / raw) To: Alexander Duyck Cc: netdev, davem, ddutt, sfeldma, alexander.duyck, hannes, stephen On Wed, Jun 10, 2015 at 08:57:55AM -0700, Alexander Duyck wrote: > On 06/09/2015 11:47 PM, Andy Gospodarek wrote: > >Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are > >reachable via an interface where carrier is off. No action is taken, > >but additional flags are passed to userspace to indicate carrier status. > > > >This also includes a cleanup to fib_disable_ip to more clearly indicate > >what event made the function call to replace the more cryptic force > >option previously used. > > > >v2: Split out kernel functionality into 2 patches, this patch simply sets and > >clears new nexthop flag RTNH_F_LINKDOWN. > > > >Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> > >Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com> > > > >--- > > include/net/ip_fib.h | 4 +-- > > include/uapi/linux/rtnetlink.h | 1 + > > net/ipv4/fib_frontend.c | 26 +++++++++++--------- > > net/ipv4/fib_semantics.c | 56 ++++++++++++++++++++++++++++++++---------- > > 4 files changed, 60 insertions(+), 27 deletions(-) > > > >diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h > >index 54271ed..d1de1b7 100644 > >--- a/include/net/ip_fib.h > >+++ b/include/net/ip_fib.h > >@@ -305,9 +305,9 @@ void fib_flush_external(struct net *net); > > > > /* Exported by fib_semantics.c */ > > int ip_fib_check_default(__be32 gw, struct net_device *dev); > >-int fib_sync_down_dev(struct net_device *dev, int force); > >+int fib_sync_down_dev(struct net_device *dev, int event); > > int fib_sync_down_addr(struct net *net, __be32 local); > >-int fib_sync_up(struct net_device *dev); > >+int fib_sync_up(struct net_device *dev, unsigned int nh_flags); > > void fib_select_multipath(struct fib_result *res); > > > > /* Exported by fib_trie.c */ > >diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h > >index 17fb02f..8dde432 100644 > >--- a/include/uapi/linux/rtnetlink.h > >+++ b/include/uapi/linux/rtnetlink.h > >@@ -338,6 +338,7 @@ struct rtnexthop { > > #define RTNH_F_PERVASIVE 2 /* Do recursive gateway lookup */ > > #define RTNH_F_ONLINK 4 /* Gateway is forced on link */ > > #define RTNH_F_OFFLOAD 8 /* offloaded route */ > >+#define RTNH_F_LINKDOWN 16 /* carrier-down on nexthop */ > > So you could probably use some sort of define here to identify which flags > are event based and which are configuration based. Then it makes it easier > to take care of code below such as the nh_comp call. So are you saying something at the top to that would reserve a few bits for whether the kernel can set it, userspace can set it, or both could set it? Seems like overkill to me and a waste of bits -- though maybe there will not be that many nexthop flags. :) > > > /* Macros to handle hexthops */ > > > >diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c > >index 872494e..1e4c646 100644 > >--- a/net/ipv4/fib_frontend.c > >+++ b/net/ipv4/fib_frontend.c > >@@ -1063,9 +1063,9 @@ static void nl_fib_lookup_exit(struct net *net) > > net->ipv4.fibnl = NULL; > > } > > > >-static void fib_disable_ip(struct net_device *dev, int force) > >+static void fib_disable_ip(struct net_device *dev, int event) > > Event should be an unsigned long to match fib_inetaddr_event and avoid any > unnecessary casts or warnings. Fixed in upcoming v3 > > > { > >- if (fib_sync_down_dev(dev, force)) > >+ if (fib_sync_down_dev(dev, event)) > > fib_flush(dev_net(dev)); > > rt_cache_flush(dev_net(dev)); > > arp_ifdown(dev); > >@@ -1080,9 +1080,7 @@ static int fib_inetaddr_event(struct notifier_block *this, unsigned long event, > > switch (event) { > > case NETDEV_UP: > > fib_add_ifaddr(ifa); > >-#ifdef CONFIG_IP_ROUTE_MULTIPATH > >- fib_sync_up(dev); > >-#endif > >+ fib_sync_up(dev, RTNH_F_DEAD); > > atomic_inc(&net->ipv4.dev_addr_genid); > > rt_cache_flush(dev_net(dev)); > > break; > > Shouldn't this bit be left wrapped in CONFIG_IP_ROUTE_MULTIPATH? I thought > RTNH_F_DEAD was only used in that case. I can double-check this one and the one referenced below in fib_netdev_event, but I really struggle to understand why one would not want to be sure that when IFF_UP is set the DEAD flags were definitely going to be cleared before continuing? > > >@@ -1093,7 +1091,7 @@ static int fib_inetaddr_event(struct notifier_block *this, unsigned long event, > > /* Last address was deleted from this interface. > > * Disable IP. > > */ > >- fib_disable_ip(dev, 1); > >+ fib_disable_ip(dev, event); > > } else { > > rt_cache_flush(dev_net(dev)); > > } > > Aren't you losing information here? The line above this change is a call to > see if ifa_list is NULL. I don't see how that data is being communicated > down to fib_disable_ip. It seems like you could end up with the wrong > scope. Fixed in fib_sync_down_dev in upcoming v3. [...] > >diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c > >index 28ec3c1..776e029 100644 > >--- a/net/ipv4/fib_semantics.c > >+++ b/net/ipv4/fib_semantics.c > >@@ -266,7 +266,7 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi) > > #ifdef CONFIG_IP_ROUTE_CLASSID > > nh->nh_tclassid != onh->nh_tclassid || > > #endif > >- ((nh->nh_flags ^ onh->nh_flags) & ~RTNH_F_DEAD)) > >+ ((nh->nh_flags ^ onh->nh_flags) & ~(RTNH_F_DEAD|RTNH_F_LINKDOWN))) > > return -1; > > onh++; > > } endfor_nexthops(fi); > >@@ -318,7 +318,7 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi) > > nfi->fib_type == fi->fib_type && > > memcmp(nfi->fib_metrics, fi->fib_metrics, > > sizeof(u32) * RTAX_MAX) == 0 && > >- ((nfi->fib_flags ^ fi->fib_flags) & ~RTNH_F_DEAD) == 0 && > >+ ((nfi->fib_flags ^ fi->fib_flags) & ~(RTNH_F_DEAD|RTNH_F_LINKDOWN)) == 0 && > > (nfi->fib_nhs == 0 || nh_comp(fi, nfi) == 0)) > > return fi; > > } > > Merging the two flags into some sort of define would probably help the > readability here. I can create something like RTNH_F_COMP_MASK for upcoming v3. [...] > > >@@ -604,6 +604,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, > > return -ENODEV; > > if (!(dev->flags & IFF_UP)) > > return -ENETDOWN; > >+ if (!netif_carrier_ok(dev)) > >+ nh->nh_flags |= RTNH_F_LINKDOWN; > > nh->nh_dev = dev; > > dev_hold(dev); > > nh->nh_scope = RT_SCOPE_LINK; > >@@ -636,6 +638,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, > > if (!dev) > > goto out; > > dev_hold(dev); > >+ if (!netif_carrier_ok(dev)) > >+ nh->nh_flags |= RTNH_F_LINKDOWN; > > err = (dev->flags & IFF_UP) ? 0 : -ENETDOWN; > > } else { > > struct in_device *in_dev; > >@@ -654,6 +658,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, > > nh->nh_dev = in_dev->dev; > > dev_hold(nh->nh_dev); > > nh->nh_scope = RT_SCOPE_HOST; > >+ if (!netif_carrier_ok(nh->nh_dev)) > >+ nh->nh_flags |= RTNH_F_LINKDOWN; > > err = 0; > > } > > out: > >@@ -920,11 +926,17 @@ struct fib_info *fib_create_info(struct fib_config *cfg) > > if (!nh->nh_dev) > > goto failure; > > } else { > >+ int linkdown = 0; > > change_nexthops(fi) { > > err = fib_check_nh(cfg, fi, nexthop_nh); > > if (err != 0) > > goto failure; > >+ if (nexthop_nh->nh_flags & RTNH_F_LINKDOWN) > >+ linkdown++; > > } endfor_nexthops(fi) > >+ if (linkdown == fi->fib_nhs) { > >+ fi->fib_flags |= RTNH_F_LINKDOWN; > >+ } > > } > > > > if (fi->fib_prefsrc) { > >@@ -1103,7 +1115,7 @@ int fib_sync_down_addr(struct net *net, __be32 local) > > return ret; > > } > > > >-int fib_sync_down_dev(struct net_device *dev, int force) > >+int fib_sync_down_dev(struct net_device *dev, int event) > > I believe event should be unsigned long to match the original argument from > fib_inetaddr_event. Agreed, in upcoming v3. > > { > > int ret = 0; > > int scope = RT_SCOPE_NOWHERE; > >@@ -1112,7 +1124,7 @@ int fib_sync_down_dev(struct net_device *dev, int force) > > struct hlist_head *head = &fib_info_devhash[hash]; > > struct fib_nh *nh; > > > >- if (force) > >+ if (event == NETDEV_UNREGISTER) > > scope = -1; > > > > So I believe there is still a gap here in relation to fib_inetaddr_event. > Specifically in the case of that function it is supposed to set the force > value to 1 which would trigger this bit of code, but that isn't occurring > with your change. Agreed. As mentioned above, I fixed this in my tree and it will be in upcoming v3. Thanks for the review, Alex! ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH net-next 2/3 v2] net: ipv4 sysctl option to ignore routes when nexthop link is down 2015-06-10 6:47 [PATCH net-next 0/3 v2] changes to make ipv4 routing table aware of next-hop link status Andy Gospodarek 2015-06-10 6:47 ` [PATCH net-next 1/3 v2] net: track link-status of ipv4 nexthops Andy Gospodarek @ 2015-06-10 6:47 ` Andy Gospodarek 2015-06-10 16:17 ` Alexander Duyck 2015-06-11 2:12 ` Scott Feldman 2015-06-10 6:47 ` [PATCH iproute2 3/3 v2] add support to print 'linkdown' nexthop flag Andy Gospodarek 2 siblings, 2 replies; 12+ messages in thread From: Andy Gospodarek @ 2015-06-10 6:47 UTC (permalink / raw) To: netdev, davem, ddutt, sfeldma, alexander.duyck, hannes, stephen Cc: Andy Gospodarek This feature is only enabled with the new per-interface or ipv4 global sysctls called 'ignore_routes_with_linkdown'. net.ipv4.conf.all.ignore_routes_with_linkdown = 0 net.ipv4.conf.default.ignore_routes_with_linkdown = 0 net.ipv4.conf.lo.ignore_routes_with_linkdown = 0 ... When the above sysctls are set, will report to userspace that a route is dead and will no longer resolve to this nexthop when performing a fib lookup. This will signal to userspace that the route will not be selected. The signalling of a RTNH_F_DEAD is only passed to userspace if the sysctl is enabled and link is down. This was done as without it the netlink listeners would have no idea whether or not a nexthop would be selected. The kernel only sets RTNH_F_DEAD internally if the inteface has IFF_UP cleared. With the new sysctl set, the following behavior can be observed (interface p8p1 is link-down): # ip route show default via 10.0.5.2 dev p9p1 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 dead linkdown 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 dead linkdown 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 # ip route get 90.0.0.1 90.0.0.1 via 70.0.0.2 dev p7p1 src 70.0.0.1 cache # ip route get 80.0.0.1 local 80.0.0.1 dev lo src 80.0.0.1 cache <local> # ip route get 80.0.0.2 80.0.0.2 via 10.0.5.2 dev p9p1 src 10.0.5.15 cache While the route does remain in the table (so it can be modified if needed rather than being wiped away as it would be if IFF_UP was cleared), the proper next-hop is chosen automatically when the link is down. Now interface p8p1 is linked-up: # ip route show default via 10.0.5.2 dev p9p1 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 192.168.56.0/24 dev p2p1 proto kernel scope link src 192.168.56.2 # ip route get 90.0.0.1 90.0.0.1 via 80.0.0.2 dev p8p1 src 80.0.0.1 cache # ip route get 80.0.0.1 local 80.0.0.1 dev lo src 80.0.0.1 cache <local> # ip route get 80.0.0.2 80.0.0.2 dev p8p1 src 80.0.0.1 cache and the output changes to what one would expect. If the sysctl is not set, the following output would be expected when p8p1 is down: # ip route show default via 10.0.5.2 dev p9p1 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 linkdown 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 linkdown 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 Since the dead flag does not appear, there should be no expectation that the kernel would skip using this route due to link being down. v2: Split kernel changes into 2 patches, this actually makes a behavioral change if the sysctl is set. Also took suggestion from Alex to simplify code by only checking sysctl during fib lookup and suggestion from Scott to add a per-interface sysctl. Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com> --- include/linux/inetdevice.h | 3 +++ include/net/fib_rules.h | 3 ++- include/net/ip_fib.h | 17 ++++++++++------- include/uapi/linux/ip.h | 1 + include/uapi/linux/sysctl.h | 1 + kernel/sysctl_binary.c | 1 + net/ipv4/devinet.c | 2 ++ net/ipv4/fib_frontend.c | 6 +++--- net/ipv4/fib_rules.c | 5 +++-- net/ipv4/fib_semantics.c | 28 ++++++++++++++++++++++------ net/ipv4/fib_trie.c | 7 +++++++ net/ipv4/netfilter/ipt_rpfilter.c | 2 +- net/ipv4/route.c | 10 +++++----- 13 files changed, 61 insertions(+), 25 deletions(-) diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h index 0a21fbe..a4328ce 100644 --- a/include/linux/inetdevice.h +++ b/include/linux/inetdevice.h @@ -120,6 +120,9 @@ static inline void ipv4_devconf_setall(struct in_device *in_dev) || (!IN_DEV_FORWARD(in_dev) && \ IN_DEV_ORCONF((in_dev), ACCEPT_REDIRECTS))) +#define IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) \ + IN_DEV_CONF_GET((in_dev), IGNORE_ROUTES_WITH_LINKDOWN) + #define IN_DEV_ARPFILTER(in_dev) IN_DEV_ORCONF((in_dev), ARPFILTER) #define IN_DEV_ARP_ACCEPT(in_dev) IN_DEV_ORCONF((in_dev), ARP_ACCEPT) #define IN_DEV_ARP_ANNOUNCE(in_dev) IN_DEV_MAXCONF((in_dev), ARP_ANNOUNCE) diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h index 6d67383..903a55e 100644 --- a/include/net/fib_rules.h +++ b/include/net/fib_rules.h @@ -36,7 +36,8 @@ struct fib_lookup_arg { void *result; struct fib_rule *rule; int flags; -#define FIB_LOOKUP_NOREF 1 +#define FIB_LOOKUP_NOREF 1 +#define FIB_LOOKUP_IGNORE_LINKSTATE 2 }; struct fib_rules_ops { diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index d1de1b7..854d790 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -226,7 +226,7 @@ static inline struct fib_table *fib_new_table(struct net *net, u32 id) } static inline int fib_lookup(struct net *net, const struct flowi4 *flp, - struct fib_result *res) + struct fib_result *res, unsigned int flags) { struct fib_table *tb; int err = -ENETUNREACH; @@ -234,7 +234,7 @@ static inline int fib_lookup(struct net *net, const struct flowi4 *flp, rcu_read_lock(); tb = fib_get_table(net, RT_TABLE_MAIN); - if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF)) + if (tb && !fib_table_lookup(tb, flp, res, flags | FIB_LOOKUP_NOREF)) err = 0; rcu_read_unlock(); @@ -249,16 +249,17 @@ void __net_exit fib4_rules_exit(struct net *net); struct fib_table *fib_new_table(struct net *net, u32 id); struct fib_table *fib_get_table(struct net *net, u32 id); -int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res); +int __fib_lookup(struct net *net, struct flowi4 *flp, + struct fib_result *res, unsigned int flags); static inline int fib_lookup(struct net *net, struct flowi4 *flp, - struct fib_result *res) + struct fib_result *res, unsigned int flags) { struct fib_table *tb; int err; if (net->ipv4.fib_has_custom_rules) - return __fib_lookup(net, flp, res); + return __fib_lookup(net, flp, res, flags | FIB_LOOKUP_NOREF); rcu_read_lock(); @@ -266,11 +267,13 @@ static inline int fib_lookup(struct net *net, struct flowi4 *flp, for (err = 0; !err; err = -ENETUNREACH) { tb = rcu_dereference_rtnl(net->ipv4.fib_main); - if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF)) + if (tb && !fib_table_lookup(tb, flp, res, + flags | FIB_LOOKUP_NOREF)) break; tb = rcu_dereference_rtnl(net->ipv4.fib_default); - if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF)) + if (tb && !fib_table_lookup(tb, flp, res, + flags | FIB_LOOKUP_NOREF)) break; } diff --git a/include/uapi/linux/ip.h b/include/uapi/linux/ip.h index 4119594..08f894d 100644 --- a/include/uapi/linux/ip.h +++ b/include/uapi/linux/ip.h @@ -164,6 +164,7 @@ enum IPV4_DEVCONF_ROUTE_LOCALNET, IPV4_DEVCONF_IGMPV2_UNSOLICITED_REPORT_INTERVAL, IPV4_DEVCONF_IGMPV3_UNSOLICITED_REPORT_INTERVAL, + IPV4_DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN, __IPV4_DEVCONF_MAX }; diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h index 0956373..62fda94 100644 --- a/include/uapi/linux/sysctl.h +++ b/include/uapi/linux/sysctl.h @@ -482,6 +482,7 @@ enum NET_IPV4_CONF_PROMOTE_SECONDARIES=20, NET_IPV4_CONF_ARP_ACCEPT=21, NET_IPV4_CONF_ARP_NOTIFY=22, + NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN=23, }; /* /proc/sys/net/ipv4/netfilter */ diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c index 7e7746a..c9d0a0e 100644 --- a/kernel/sysctl_binary.c +++ b/kernel/sysctl_binary.c @@ -253,6 +253,7 @@ static const struct bin_table bin_net_ipv4_conf_vars_table[] = { { CTL_INT, NET_IPV4_CONF_NOPOLICY, "disable_policy" }, { CTL_INT, NET_IPV4_CONF_FORCE_IGMP_VERSION, "force_igmp_version" }, { CTL_INT, NET_IPV4_CONF_PROMOTE_SECONDARIES, "promote_secondaries" }, + { CTL_INT, NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN, "ignore_routes_with_linkdown" }, {} }; diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index 419d23c..7498716 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -2169,6 +2169,8 @@ static struct devinet_sysctl_table { "igmpv2_unsolicited_report_interval"), DEVINET_SYSCTL_RW_ENTRY(IGMPV3_UNSOLICITED_REPORT_INTERVAL, "igmpv3_unsolicited_report_interval"), + DEVINET_SYSCTL_RW_ENTRY(IGNORE_ROUTES_WITH_LINKDOWN, + "ignore_routes_with_linkdown"), DEVINET_SYSCTL_FLUSHING_ENTRY(NOXFRM, "disable_xfrm"), DEVINET_SYSCTL_FLUSHING_ENTRY(NOPOLICY, "disable_policy"), diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 1e4c646..ead31c6 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -280,7 +280,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb) fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos); fl4.flowi4_scope = scope; fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0; - if (!fib_lookup(net, &fl4, &res)) + if (!fib_lookup(net, &fl4, &res, 0)) return FIB_RES_PREFSRC(net, res); } else { scope = RT_SCOPE_LINK; @@ -319,7 +319,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb->mark : 0; net = dev_net(dev); - if (fib_lookup(net, &fl4, &res)) + if (fib_lookup(net, &fl4, &res, 0)) goto last_resort; if (res.type != RTN_UNICAST && (res.type != RTN_LOCAL || !IN_DEV_ACCEPT_LOCAL(idev))) @@ -354,7 +354,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, fl4.flowi4_oif = dev->ifindex; ret = 0; - if (fib_lookup(net, &fl4, &res) == 0) { + if (fib_lookup(net, &fl4, &res, 0) == 0) { if (res.type == RTN_UNICAST) ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST; } diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c index 5615198..18123d5 100644 --- a/net/ipv4/fib_rules.c +++ b/net/ipv4/fib_rules.c @@ -47,11 +47,12 @@ struct fib4_rule { #endif }; -int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res) +int __fib_lookup(struct net *net, struct flowi4 *flp, + struct fib_result *res, unsigned int flags) { struct fib_lookup_arg arg = { .result = res, - .flags = FIB_LOOKUP_NOREF, + .flags = flags, }; int err; diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 776e029..4dd709f 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -623,7 +623,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, /* It is not necessary, but requires a bit of thinking */ if (fl4.flowi4_scope < RT_SCOPE_LINK) fl4.flowi4_scope = RT_SCOPE_LINK; - err = fib_lookup(net, &fl4, &res); + err = fib_lookup(net, &fl4, &res, + FIB_LOOKUP_IGNORE_LINKSTATE); if (err) { rcu_read_unlock(); return err; @@ -1035,12 +1036,16 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event, nla_put_in_addr(skb, RTA_PREFSRC, fi->fib_prefsrc)) goto nla_put_failure; if (fi->fib_nhs == 1) { + struct in_device *in_dev = __in_dev_get_rcu(fi->fib_nh->nh_dev); if (fi->fib_nh->nh_gw && nla_put_in_addr(skb, RTA_GATEWAY, fi->fib_nh->nh_gw)) goto nla_put_failure; if (fi->fib_nh->nh_oif && nla_put_u32(skb, RTA_OIF, fi->fib_nh->nh_oif)) goto nla_put_failure; + if (in_dev && IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && + fi->fib_nh->nh_flags & RTNH_F_LINKDOWN) + rtm->rtm_flags |= RTNH_F_DEAD; #ifdef CONFIG_IP_ROUTE_CLASSID if (fi->fib_nh[0].nh_tclassid && nla_put_u32(skb, RTA_FLOW, fi->fib_nh[0].nh_tclassid)) @@ -1057,11 +1062,16 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event, goto nla_put_failure; for_nexthops(fi) { + struct in_device *in_dev = __in_dev_get_rcu(nh->nh_dev); rtnh = nla_reserve_nohdr(skb, sizeof(*rtnh)); if (!rtnh) goto nla_put_failure; - rtnh->rtnh_flags = nh->nh_flags & 0xFF; + if (in_dev && IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && + nh->nh_flags & RTNH_F_LINKDOWN) + rtnh->rtnh_flags = (nh->nh_flags | RTNH_F_DEAD) & 0xFF; + else + rtnh->rtnh_flags = nh->nh_flags & 0xFF; rtnh->rtnh_hops = nh->nh_weight - 1; rtnh->rtnh_ifindex = nh->nh_oif; @@ -1306,16 +1316,22 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags) void fib_select_multipath(struct fib_result *res) { struct fib_info *fi = res->fi; + struct in_device *in_dev; int w; spin_lock_bh(&fib_multipath_lock); if (fi->fib_power <= 0) { int power = 0; change_nexthops(fi) { - if (!(nexthop_nh->nh_flags & RTNH_F_DEAD)) { - power += nexthop_nh->nh_weight; - nexthop_nh->nh_power = nexthop_nh->nh_weight; - } + in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev); + if (nexthop_nh->nh_flags & RTNH_F_DEAD) + continue; + if (in_dev && + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && + nexthop_nh->nh_flags & RTNH_F_LINKDOWN) + continue; + power += nexthop_nh->nh_weight; + nexthop_nh->nh_power = nexthop_nh->nh_weight; } endfor_nexthops(fi); fi->fib_power = power; if (power <= 0) { diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 3c699c4..f75ca20 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1407,11 +1407,18 @@ found: } if (fi->fib_flags & RTNH_F_DEAD) continue; + for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) { const struct fib_nh *nh = &fi->fib_nh[nhsel]; + struct in_device *in_dev = __in_dev_get_rcu(nh->nh_dev); if (nh->nh_flags & RTNH_F_DEAD) continue; + if (in_dev && + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && + nh->nh_flags & RTNH_F_LINKDOWN && + !(fib_flags & FIB_LOOKUP_IGNORE_LINKSTATE)) + continue; if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif) continue; diff --git a/net/ipv4/netfilter/ipt_rpfilter.c b/net/ipv4/netfilter/ipt_rpfilter.c index 4bfaedf..250c633 100644 --- a/net/ipv4/netfilter/ipt_rpfilter.c +++ b/net/ipv4/netfilter/ipt_rpfilter.c @@ -40,7 +40,7 @@ static bool rpfilter_lookup_reverse(struct flowi4 *fl4, struct net *net = dev_net(dev); int ret __maybe_unused; - if (fib_lookup(net, fl4, &res)) + if (fib_lookup(net, fl4, &res, 0)) return false; if (res.type != RTN_UNICAST) { diff --git a/net/ipv4/route.c b/net/ipv4/route.c index f605598..d0362a2 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -747,7 +747,7 @@ static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flow if (!(n->nud_state & NUD_VALID)) { neigh_event_send(n, NULL); } else { - if (fib_lookup(net, fl4, &res) == 0) { + if (fib_lookup(net, fl4, &res, 0) == 0) { struct fib_nh *nh = &FIB_RES_NH(res); update_or_create_fnhe(nh, fl4->daddr, new_gw, @@ -975,7 +975,7 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu) return; rcu_read_lock(); - if (fib_lookup(dev_net(dst->dev), fl4, &res) == 0) { + if (fib_lookup(dev_net(dst->dev), fl4, &res, 0) == 0) { struct fib_nh *nh = &FIB_RES_NH(res); update_or_create_fnhe(nh, fl4->daddr, 0, mtu, @@ -1186,7 +1186,7 @@ void ip_rt_get_source(u8 *addr, struct sk_buff *skb, struct rtable *rt) fl4.flowi4_mark = skb->mark; rcu_read_lock(); - if (fib_lookup(dev_net(rt->dst.dev), &fl4, &res) == 0) + if (fib_lookup(dev_net(rt->dst.dev), &fl4, &res, 0) == 0) src = FIB_RES_PREFSRC(dev_net(rt->dst.dev), res); else src = inet_select_addr(rt->dst.dev, @@ -1716,7 +1716,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, fl4.flowi4_scope = RT_SCOPE_UNIVERSE; fl4.daddr = daddr; fl4.saddr = saddr; - err = fib_lookup(net, &fl4, &res); + err = fib_lookup(net, &fl4, &res, 0); if (err != 0) { if (!IN_DEV_FORWARD(in_dev)) err = -EHOSTUNREACH; @@ -2123,7 +2123,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) goto make_route; } - if (fib_lookup(net, fl4, &res)) { + if (fib_lookup(net, fl4, &res, 0)) { res.fi = NULL; res.table = NULL; if (fl4->flowi4_oif) { -- 1.9.3 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 2/3 v2] net: ipv4 sysctl option to ignore routes when nexthop link is down 2015-06-10 6:47 ` [PATCH net-next 2/3 v2] net: ipv4 sysctl option to ignore routes when nexthop link is down Andy Gospodarek @ 2015-06-10 16:17 ` Alexander Duyck 2015-06-10 19:04 ` Andy Gospodarek 2015-06-11 2:12 ` Scott Feldman 1 sibling, 1 reply; 12+ messages in thread From: Alexander Duyck @ 2015-06-10 16:17 UTC (permalink / raw) To: Andy Gospodarek, netdev, davem, ddutt, sfeldma, alexander.duyck, hannes, stephen On 06/09/2015 11:47 PM, Andy Gospodarek wrote: > This feature is only enabled with the new per-interface or ipv4 global > sysctls called 'ignore_routes_with_linkdown'. > > net.ipv4.conf.all.ignore_routes_with_linkdown = 0 > net.ipv4.conf.default.ignore_routes_with_linkdown = 0 > net.ipv4.conf.lo.ignore_routes_with_linkdown = 0 > ... > > When the above sysctls are set, will report to userspace that a route is > dead and will no longer resolve to this nexthop when performing a fib > lookup. This will signal to userspace that the route will not be > selected. The signalling of a RTNH_F_DEAD is only passed to userspace > if the sysctl is enabled and link is down. This was done as without it the > netlink listeners would have no idea whether or not a nexthop would be > selected. The kernel only sets RTNH_F_DEAD internally if the inteface has > IFF_UP cleared. > > With the new sysctl set, the following behavior can be observed > (interface p8p1 is link-down): > > # ip route show > default via 10.0.5.2 dev p9p1 > 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 > 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 > 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 dead linkdown > 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 dead linkdown > 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 > # ip route get 90.0.0.1 > 90.0.0.1 via 70.0.0.2 dev p7p1 src 70.0.0.1 > cache > # ip route get 80.0.0.1 > local 80.0.0.1 dev lo src 80.0.0.1 > cache <local> > # ip route get 80.0.0.2 > 80.0.0.2 via 10.0.5.2 dev p9p1 src 10.0.5.15 > cache > > While the route does remain in the table (so it can be modified if > needed rather than being wiped away as it would be if IFF_UP was > cleared), the proper next-hop is chosen automatically when the link is > down. Now interface p8p1 is linked-up: > > # ip route show > default via 10.0.5.2 dev p9p1 > 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 > 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 > 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 > 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 > 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 > 192.168.56.0/24 dev p2p1 proto kernel scope link src 192.168.56.2 > # ip route get 90.0.0.1 > 90.0.0.1 via 80.0.0.2 dev p8p1 src 80.0.0.1 > cache > # ip route get 80.0.0.1 > local 80.0.0.1 dev lo src 80.0.0.1 > cache <local> > # ip route get 80.0.0.2 > 80.0.0.2 dev p8p1 src 80.0.0.1 > cache > > and the output changes to what one would expect. > > If the sysctl is not set, the following output would be expected when > p8p1 is down: > > # ip route show > default via 10.0.5.2 dev p9p1 > 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 > 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 > 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 linkdown > 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 linkdown > 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 > > Since the dead flag does not appear, there should be no expectation that > the kernel would skip using this route due to link being down. > > v2: Split kernel changes into 2 patches, this actually makes a > behavioral change if the sysctl is set. Also took suggestion from Alex > to simplify code by only checking sysctl during fib lookup and > suggestion from Scott to add a per-interface sysctl. > > Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> > Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com> > --- > include/linux/inetdevice.h | 3 +++ > include/net/fib_rules.h | 3 ++- > include/net/ip_fib.h | 17 ++++++++++------- > include/uapi/linux/ip.h | 1 + > include/uapi/linux/sysctl.h | 1 + > kernel/sysctl_binary.c | 1 + > net/ipv4/devinet.c | 2 ++ > net/ipv4/fib_frontend.c | 6 +++--- > net/ipv4/fib_rules.c | 5 +++-- > net/ipv4/fib_semantics.c | 28 ++++++++++++++++++++++------ > net/ipv4/fib_trie.c | 7 +++++++ > net/ipv4/netfilter/ipt_rpfilter.c | 2 +- > net/ipv4/route.c | 10 +++++----- > 13 files changed, 61 insertions(+), 25 deletions(-) > > diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h > index 0a21fbe..a4328ce 100644 > --- a/include/linux/inetdevice.h > +++ b/include/linux/inetdevice.h > @@ -120,6 +120,9 @@ static inline void ipv4_devconf_setall(struct in_device *in_dev) > || (!IN_DEV_FORWARD(in_dev) && \ > IN_DEV_ORCONF((in_dev), ACCEPT_REDIRECTS))) > > +#define IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) \ > + IN_DEV_CONF_GET((in_dev), IGNORE_ROUTES_WITH_LINKDOWN) > + > #define IN_DEV_ARPFILTER(in_dev) IN_DEV_ORCONF((in_dev), ARPFILTER) > #define IN_DEV_ARP_ACCEPT(in_dev) IN_DEV_ORCONF((in_dev), ARP_ACCEPT) > #define IN_DEV_ARP_ANNOUNCE(in_dev) IN_DEV_MAXCONF((in_dev), ARP_ANNOUNCE) > diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h > index 6d67383..903a55e 100644 > --- a/include/net/fib_rules.h > +++ b/include/net/fib_rules.h > @@ -36,7 +36,8 @@ struct fib_lookup_arg { > void *result; > struct fib_rule *rule; > int flags; > -#define FIB_LOOKUP_NOREF 1 > +#define FIB_LOOKUP_NOREF 1 > +#define FIB_LOOKUP_IGNORE_LINKSTATE 2 > }; > > struct fib_rules_ops { > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h > index d1de1b7..854d790 100644 > --- a/include/net/ip_fib.h > +++ b/include/net/ip_fib.h > @@ -226,7 +226,7 @@ static inline struct fib_table *fib_new_table(struct net *net, u32 id) > } > > static inline int fib_lookup(struct net *net, const struct flowi4 *flp, > - struct fib_result *res) > + struct fib_result *res, unsigned int flags) > { > struct fib_table *tb; > int err = -ENETUNREACH; > @@ -234,7 +234,7 @@ static inline int fib_lookup(struct net *net, const struct flowi4 *flp, > rcu_read_lock(); > > tb = fib_get_table(net, RT_TABLE_MAIN); > - if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF)) > + if (tb && !fib_table_lookup(tb, flp, res, flags | FIB_LOOKUP_NOREF)) > err = 0; > > rcu_read_unlock(); > @@ -249,16 +249,17 @@ void __net_exit fib4_rules_exit(struct net *net); > struct fib_table *fib_new_table(struct net *net, u32 id); > struct fib_table *fib_get_table(struct net *net, u32 id); > > -int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res); > +int __fib_lookup(struct net *net, struct flowi4 *flp, > + struct fib_result *res, unsigned int flags); > > static inline int fib_lookup(struct net *net, struct flowi4 *flp, > - struct fib_result *res) > + struct fib_result *res, unsigned int flags) > { > struct fib_table *tb; > int err; > > if (net->ipv4.fib_has_custom_rules) > - return __fib_lookup(net, flp, res); > + return __fib_lookup(net, flp, res, flags | FIB_LOOKUP_NOREF); > > rcu_read_lock(); > > @@ -266,11 +267,13 @@ static inline int fib_lookup(struct net *net, struct flowi4 *flp, > > for (err = 0; !err; err = -ENETUNREACH) { > tb = rcu_dereference_rtnl(net->ipv4.fib_main); > - if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF)) > + if (tb && !fib_table_lookup(tb, flp, res, > + flags | FIB_LOOKUP_NOREF)) > break; > > tb = rcu_dereference_rtnl(net->ipv4.fib_default); > - if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF)) > + if (tb && !fib_table_lookup(tb, flp, res, > + flags | FIB_LOOKUP_NOREF)) > break; > } > Instead of 3 lines w/ flags | FIB_LOOKUP_NOREF you could probably just do a flags |= FIB_LOOKUP_NOREF once and save yourself some trouble. > diff --git a/include/uapi/linux/ip.h b/include/uapi/linux/ip.h > index 4119594..08f894d 100644 > --- a/include/uapi/linux/ip.h > +++ b/include/uapi/linux/ip.h > @@ -164,6 +164,7 @@ enum > IPV4_DEVCONF_ROUTE_LOCALNET, > IPV4_DEVCONF_IGMPV2_UNSOLICITED_REPORT_INTERVAL, > IPV4_DEVCONF_IGMPV3_UNSOLICITED_REPORT_INTERVAL, > + IPV4_DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN, > __IPV4_DEVCONF_MAX > }; > > diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h > index 0956373..62fda94 100644 > --- a/include/uapi/linux/sysctl.h > +++ b/include/uapi/linux/sysctl.h > @@ -482,6 +482,7 @@ enum > NET_IPV4_CONF_PROMOTE_SECONDARIES=20, > NET_IPV4_CONF_ARP_ACCEPT=21, > NET_IPV4_CONF_ARP_NOTIFY=22, > + NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN=23, > }; > > /* /proc/sys/net/ipv4/netfilter */ > diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c > index 7e7746a..c9d0a0e 100644 > --- a/kernel/sysctl_binary.c > +++ b/kernel/sysctl_binary.c > @@ -253,6 +253,7 @@ static const struct bin_table bin_net_ipv4_conf_vars_table[] = { > { CTL_INT, NET_IPV4_CONF_NOPOLICY, "disable_policy" }, > { CTL_INT, NET_IPV4_CONF_FORCE_IGMP_VERSION, "force_igmp_version" }, > { CTL_INT, NET_IPV4_CONF_PROMOTE_SECONDARIES, "promote_secondaries" }, > + { CTL_INT, NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN, "ignore_routes_with_linkdown" }, > {} > }; > > diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c > index 419d23c..7498716 100644 > --- a/net/ipv4/devinet.c > +++ b/net/ipv4/devinet.c > @@ -2169,6 +2169,8 @@ static struct devinet_sysctl_table { > "igmpv2_unsolicited_report_interval"), > DEVINET_SYSCTL_RW_ENTRY(IGMPV3_UNSOLICITED_REPORT_INTERVAL, > "igmpv3_unsolicited_report_interval"), > + DEVINET_SYSCTL_RW_ENTRY(IGNORE_ROUTES_WITH_LINKDOWN, > + "ignore_routes_with_linkdown"), > > DEVINET_SYSCTL_FLUSHING_ENTRY(NOXFRM, "disable_xfrm"), > DEVINET_SYSCTL_FLUSHING_ENTRY(NOPOLICY, "disable_policy"), > diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c > index 1e4c646..ead31c6 100644 > --- a/net/ipv4/fib_frontend.c > +++ b/net/ipv4/fib_frontend.c > @@ -280,7 +280,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb) > fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos); > fl4.flowi4_scope = scope; > fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0; > - if (!fib_lookup(net, &fl4, &res)) > + if (!fib_lookup(net, &fl4, &res, 0)) > return FIB_RES_PREFSRC(net, res); > } else { > scope = RT_SCOPE_LINK; > @@ -319,7 +319,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, > fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb->mark : 0; > > net = dev_net(dev); > - if (fib_lookup(net, &fl4, &res)) > + if (fib_lookup(net, &fl4, &res, 0)) > goto last_resort; > if (res.type != RTN_UNICAST && > (res.type != RTN_LOCAL || !IN_DEV_ACCEPT_LOCAL(idev))) > @@ -354,7 +354,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, > fl4.flowi4_oif = dev->ifindex; > > ret = 0; > - if (fib_lookup(net, &fl4, &res) == 0) { > + if (fib_lookup(net, &fl4, &res, 0) == 0) { > if (res.type == RTN_UNICAST) > ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST; > } The code for validating a source could probably ignore the LINKDOWN message. Otherwise we run the risk of a link flapping and confusing the source since the link is down but any Rx packets in the rings are being flushed. > diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c > index 5615198..18123d5 100644 > --- a/net/ipv4/fib_rules.c > +++ b/net/ipv4/fib_rules.c > @@ -47,11 +47,12 @@ struct fib4_rule { > #endif > }; > > -int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res) > +int __fib_lookup(struct net *net, struct flowi4 *flp, > + struct fib_result *res, unsigned int flags) > { > struct fib_lookup_arg arg = { > .result = res, > - .flags = FIB_LOOKUP_NOREF, > + .flags = flags, > }; > int err; > > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c > index 776e029..4dd709f 100644 > --- a/net/ipv4/fib_semantics.c > +++ b/net/ipv4/fib_semantics.c > @@ -623,7 +623,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, > /* It is not necessary, but requires a bit of thinking */ > if (fl4.flowi4_scope < RT_SCOPE_LINK) > fl4.flowi4_scope = RT_SCOPE_LINK; > - err = fib_lookup(net, &fl4, &res); > + err = fib_lookup(net, &fl4, &res, > + FIB_LOOKUP_IGNORE_LINKSTATE); > if (err) { > rcu_read_unlock(); > return err; > @@ -1035,12 +1036,16 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event, > nla_put_in_addr(skb, RTA_PREFSRC, fi->fib_prefsrc)) > goto nla_put_failure; > if (fi->fib_nhs == 1) { > + struct in_device *in_dev = __in_dev_get_rcu(fi->fib_nh->nh_dev); > if (fi->fib_nh->nh_gw && > nla_put_in_addr(skb, RTA_GATEWAY, fi->fib_nh->nh_gw)) > goto nla_put_failure; > if (fi->fib_nh->nh_oif && > nla_put_u32(skb, RTA_OIF, fi->fib_nh->nh_oif)) > goto nla_put_failure; > + if (in_dev && IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > + fi->fib_nh->nh_flags & RTNH_F_LINKDOWN) > + rtm->rtm_flags |= RTNH_F_DEAD; > #ifdef CONFIG_IP_ROUTE_CLASSID > if (fi->fib_nh[0].nh_tclassid && > nla_put_u32(skb, RTA_FLOW, fi->fib_nh[0].nh_tclassid)) > @@ -1057,11 +1062,16 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event, > goto nla_put_failure; > > for_nexthops(fi) { > + struct in_device *in_dev = __in_dev_get_rcu(nh->nh_dev); > rtnh = nla_reserve_nohdr(skb, sizeof(*rtnh)); > if (!rtnh) > goto nla_put_failure; > > - rtnh->rtnh_flags = nh->nh_flags & 0xFF; > + if (in_dev && IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > + nh->nh_flags & RTNH_F_LINKDOWN) > + rtnh->rtnh_flags = (nh->nh_flags | RTNH_F_DEAD) & 0xFF; > + else > + rtnh->rtnh_flags = nh->nh_flags & 0xFF; > rtnh->rtnh_hops = nh->nh_weight - 1; > rtnh->rtnh_ifindex = nh->nh_oif; > Why not just split this if into two seperate statments? One taking care of the first setting of rtnh_flags and then a second one ORing in the RTNH_F_DEAD. > @@ -1306,16 +1316,22 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags) > void fib_select_multipath(struct fib_result *res) > { > struct fib_info *fi = res->fi; > + struct in_device *in_dev; > int w; > > spin_lock_bh(&fib_multipath_lock); > if (fi->fib_power <= 0) { > int power = 0; > change_nexthops(fi) { > - if (!(nexthop_nh->nh_flags & RTNH_F_DEAD)) { > - power += nexthop_nh->nh_weight; > - nexthop_nh->nh_power = nexthop_nh->nh_weight; > - } > + in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev); > + if (nexthop_nh->nh_flags & RTNH_F_DEAD) > + continue; > + if (in_dev && > + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > + nexthop_nh->nh_flags & RTNH_F_LINKDOWN) > + continue; > + power += nexthop_nh->nh_weight; > + nexthop_nh->nh_power = nexthop_nh->nh_weight; > } endfor_nexthops(fi); > fi->fib_power = power; > if (power <= 0) { > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 3c699c4..f75ca20 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -1407,11 +1407,18 @@ found: > } > if (fi->fib_flags & RTNH_F_DEAD) > continue; > + > for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) { > const struct fib_nh *nh = &fi->fib_nh[nhsel]; > + struct in_device *in_dev = __in_dev_get_rcu(nh->nh_dev); > > if (nh->nh_flags & RTNH_F_DEAD) > continue; > + if (in_dev && > + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > + nh->nh_flags & RTNH_F_LINKDOWN && > + !(fib_flags & FIB_LOOKUP_IGNORE_LINKSTATE)) > + continue; > if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif) > continue; > The order of checks should be: 1. (nh->nh_flags & RTNH_F_LINKDOWN) 2. !(fib_flags & FIB_LOOKUP_IGNORE_LINKSTATE) 3. in_dev 4. IGNORE_ROUTES_WITH_LINKDOWN That way we don't waste time checking the in_dev if the link isn't reported as being down. Also I would probably move the whole block inside an if statement based off of the first 2 checks since nothing else is making use of in_dev. > diff --git a/net/ipv4/netfilter/ipt_rpfilter.c b/net/ipv4/netfilter/ipt_rpfilter.c > index 4bfaedf..250c633 100644 > --- a/net/ipv4/netfilter/ipt_rpfilter.c > +++ b/net/ipv4/netfilter/ipt_rpfilter.c > @@ -40,7 +40,7 @@ static bool rpfilter_lookup_reverse(struct flowi4 *fl4, > struct net *net = dev_net(dev); > int ret __maybe_unused; > > - if (fib_lookup(net, fl4, &res)) > + if (fib_lookup(net, fl4, &res, 0)) > return false; > > if (res.type != RTN_UNICAST) { Any rpfilter stuff can probably ignore the linkdown check since it is possible that a driver could be flushing data just after a link went down. > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > index f605598..d0362a2 100644 > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -747,7 +747,7 @@ static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flow > if (!(n->nud_state & NUD_VALID)) { > neigh_event_send(n, NULL); > } else { > - if (fib_lookup(net, fl4, &res) == 0) { > + if (fib_lookup(net, fl4, &res, 0) == 0) { > struct fib_nh *nh = &FIB_RES_NH(res); > > update_or_create_fnhe(nh, fl4->daddr, new_gw, > @@ -975,7 +975,7 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu) > return; > > rcu_read_lock(); > - if (fib_lookup(dev_net(dst->dev), fl4, &res) == 0) { > + if (fib_lookup(dev_net(dst->dev), fl4, &res, 0) == 0) { > struct fib_nh *nh = &FIB_RES_NH(res); > > update_or_create_fnhe(nh, fl4->daddr, 0, mtu, > @@ -1186,7 +1186,7 @@ void ip_rt_get_source(u8 *addr, struct sk_buff *skb, struct rtable *rt) > fl4.flowi4_mark = skb->mark; > > rcu_read_lock(); > - if (fib_lookup(dev_net(rt->dst.dev), &fl4, &res) == 0) > + if (fib_lookup(dev_net(rt->dst.dev), &fl4, &res, 0) == 0) > src = FIB_RES_PREFSRC(dev_net(rt->dst.dev), res); > else > src = inet_select_addr(rt->dst.dev, > @@ -1716,7 +1716,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, > fl4.flowi4_scope = RT_SCOPE_UNIVERSE; > fl4.daddr = daddr; > fl4.saddr = saddr; > - err = fib_lookup(net, &fl4, &res); > + err = fib_lookup(net, &fl4, &res, 0); > if (err != 0) { > if (!IN_DEV_FORWARD(in_dev)) > err = -EHOSTUNREACH; > @@ -2123,7 +2123,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) > goto make_route; > } > > - if (fib_lookup(net, fl4, &res)) { > + if (fib_lookup(net, fl4, &res, 0)) { > res.fi = NULL; > res.table = NULL; > if (fl4->flowi4_oif) { > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 2/3 v2] net: ipv4 sysctl option to ignore routes when nexthop link is down 2015-06-10 16:17 ` Alexander Duyck @ 2015-06-10 19:04 ` Andy Gospodarek 0 siblings, 0 replies; 12+ messages in thread From: Andy Gospodarek @ 2015-06-10 19:04 UTC (permalink / raw) To: Alexander Duyck Cc: netdev, davem, ddutt, sfeldma, alexander.duyck, hannes, stephen On Wed, Jun 10, 2015 at 09:17:19AM -0700, Alexander Duyck wrote: > > > On 06/09/2015 11:47 PM, Andy Gospodarek wrote: > >This feature is only enabled with the new per-interface or ipv4 global > >sysctls called 'ignore_routes_with_linkdown'. > > > >net.ipv4.conf.all.ignore_routes_with_linkdown = 0 > >net.ipv4.conf.default.ignore_routes_with_linkdown = 0 > >net.ipv4.conf.lo.ignore_routes_with_linkdown = 0 > >... > > > >When the above sysctls are set, will report to userspace that a route is > >dead and will no longer resolve to this nexthop when performing a fib > >lookup. This will signal to userspace that the route will not be > >selected. The signalling of a RTNH_F_DEAD is only passed to userspace > >if the sysctl is enabled and link is down. This was done as without it the > >netlink listeners would have no idea whether or not a nexthop would be > >selected. The kernel only sets RTNH_F_DEAD internally if the inteface has > >IFF_UP cleared. > > > >With the new sysctl set, the following behavior can be observed > >(interface p8p1 is link-down): > > > ># ip route show > >default via 10.0.5.2 dev p9p1 > >10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 > >70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 > >80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 dead linkdown > >90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 dead linkdown > >90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 > ># ip route get 90.0.0.1 > >90.0.0.1 via 70.0.0.2 dev p7p1 src 70.0.0.1 > > cache > ># ip route get 80.0.0.1 > >local 80.0.0.1 dev lo src 80.0.0.1 > > cache <local> > ># ip route get 80.0.0.2 > >80.0.0.2 via 10.0.5.2 dev p9p1 src 10.0.5.15 > > cache > > > >While the route does remain in the table (so it can be modified if > >needed rather than being wiped away as it would be if IFF_UP was > >cleared), the proper next-hop is chosen automatically when the link is > >down. Now interface p8p1 is linked-up: > > > ># ip route show > >default via 10.0.5.2 dev p9p1 > >10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 > >70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 > >80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 > >90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 > >90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 > >192.168.56.0/24 dev p2p1 proto kernel scope link src 192.168.56.2 > ># ip route get 90.0.0.1 > >90.0.0.1 via 80.0.0.2 dev p8p1 src 80.0.0.1 > > cache > ># ip route get 80.0.0.1 > >local 80.0.0.1 dev lo src 80.0.0.1 > > cache <local> > ># ip route get 80.0.0.2 > >80.0.0.2 dev p8p1 src 80.0.0.1 > > cache > > > >and the output changes to what one would expect. > > > >If the sysctl is not set, the following output would be expected when > >p8p1 is down: > > > ># ip route show > >default via 10.0.5.2 dev p9p1 > >10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 > >70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 > >80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 linkdown > >90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 linkdown > >90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 > > > >Since the dead flag does not appear, there should be no expectation that > >the kernel would skip using this route due to link being down. > > > >v2: Split kernel changes into 2 patches, this actually makes a > >behavioral change if the sysctl is set. Also took suggestion from Alex > >to simplify code by only checking sysctl during fib lookup and > >suggestion from Scott to add a per-interface sysctl. > > > >Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> > >Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com> > >--- > > include/linux/inetdevice.h | 3 +++ > > include/net/fib_rules.h | 3 ++- > > include/net/ip_fib.h | 17 ++++++++++------- > > include/uapi/linux/ip.h | 1 + > > include/uapi/linux/sysctl.h | 1 + > > kernel/sysctl_binary.c | 1 + > > net/ipv4/devinet.c | 2 ++ > > net/ipv4/fib_frontend.c | 6 +++--- > > net/ipv4/fib_rules.c | 5 +++-- > > net/ipv4/fib_semantics.c | 28 ++++++++++++++++++++++------ > > net/ipv4/fib_trie.c | 7 +++++++ > > net/ipv4/netfilter/ipt_rpfilter.c | 2 +- > > net/ipv4/route.c | 10 +++++----- > > 13 files changed, 61 insertions(+), 25 deletions(-) [...] > >diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h > >index d1de1b7..854d790 100644 > >--- a/include/net/ip_fib.h > >+++ b/include/net/ip_fib.h > >@@ -266,11 +267,13 @@ static inline int fib_lookup(struct net *net, struct flowi4 *flp, > > > > for (err = 0; !err; err = -ENETUNREACH) { > > tb = rcu_dereference_rtnl(net->ipv4.fib_main); > >- if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF)) > >+ if (tb && !fib_table_lookup(tb, flp, res, > >+ flags | FIB_LOOKUP_NOREF)) > > break; > > > > tb = rcu_dereference_rtnl(net->ipv4.fib_default); > >- if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF)) > >+ if (tb && !fib_table_lookup(tb, flp, res, > >+ flags | FIB_LOOKUP_NOREF)) > > break; > > } > > > > Instead of 3 lines w/ flags | FIB_LOOKUP_NOREF you could probably just do a > flags |= FIB_LOOKUP_NOREF once and save yourself some trouble. Sure. But I get credit for less lines that way. ;-) [...] > >@@ -319,7 +319,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, > > fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb->mark : 0; > > > > net = dev_net(dev); > >- if (fib_lookup(net, &fl4, &res)) > >+ if (fib_lookup(net, &fl4, &res, 0)) > > goto last_resort; > > if (res.type != RTN_UNICAST && > > (res.type != RTN_LOCAL || !IN_DEV_ACCEPT_LOCAL(idev))) > >@@ -354,7 +354,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, > > fl4.flowi4_oif = dev->ifindex; > > > > ret = 0; > >- if (fib_lookup(net, &fl4, &res) == 0) { > >+ if (fib_lookup(net, &fl4, &res, 0) == 0) { > > if (res.type == RTN_UNICAST) > > ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST; > > } > > The code for validating a source could probably ignore the LINKDOWN message. > Otherwise we run the risk of a link flapping and confusing the source since > the link is down but any Rx packets in the rings are being flushed. Excellent point. After thinking about this a bit, I think you are correct that we would want to consider a dead link or an alive link as a valid interface for receiving traffic. Flag added for v3. [...] > >@@ -1057,11 +1062,16 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event, > > goto nla_put_failure; > > > > for_nexthops(fi) { > >+ struct in_device *in_dev = __in_dev_get_rcu(nh->nh_dev); > > rtnh = nla_reserve_nohdr(skb, sizeof(*rtnh)); > > if (!rtnh) > > goto nla_put_failure; > > > >- rtnh->rtnh_flags = nh->nh_flags & 0xFF; > >+ if (in_dev && IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > >+ nh->nh_flags & RTNH_F_LINKDOWN) > >+ rtnh->rtnh_flags = (nh->nh_flags | RTNH_F_DEAD) & 0xFF; > >+ else > >+ rtnh->rtnh_flags = nh->nh_flags & 0xFF; > > rtnh->rtnh_hops = nh->nh_weight - 1; > > rtnh->rtnh_ifindex = nh->nh_oif; > > > > Why not just split this if into two seperate statments? One taking care of > the first setting of rtnh_flags and then a second one ORing in the > RTNH_F_DEAD. If that seems easier to maintain, I can do that for v3. [...] > >diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > >index 3c699c4..f75ca20 100644 > >--- a/net/ipv4/fib_trie.c > >+++ b/net/ipv4/fib_trie.c > >@@ -1407,11 +1407,18 @@ found: > > } > > if (fi->fib_flags & RTNH_F_DEAD) > > continue; > >+ > > for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) { > > const struct fib_nh *nh = &fi->fib_nh[nhsel]; > >+ struct in_device *in_dev = __in_dev_get_rcu(nh->nh_dev); > > > > if (nh->nh_flags & RTNH_F_DEAD) > > continue; > >+ if (in_dev && > >+ IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > >+ nh->nh_flags & RTNH_F_LINKDOWN && > >+ !(fib_flags & FIB_LOOKUP_IGNORE_LINKSTATE)) > >+ continue; > > if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif) > > continue; > > > > The order of checks should be: > 1. (nh->nh_flags & RTNH_F_LINKDOWN) > 2. !(fib_flags & FIB_LOOKUP_IGNORE_LINKSTATE) This one is not needed as we will not have this flag set anywhere but 1, 3, and 4 in that order seems cleaner. > 3. in_dev > 4. IGNORE_ROUTES_WITH_LINKDOWN > > That way we don't waste time checking the in_dev if the link isn't reported > as being down. Also I would probably move the whole block inside an if > statement based off of the first 2 checks since nothing else is making use > of in_dev. This seems like a nice optimization. I'll do it here and above outside the nh loop. > > >diff --git a/net/ipv4/netfilter/ipt_rpfilter.c b/net/ipv4/netfilter/ipt_rpfilter.c > >index 4bfaedf..250c633 100644 > >--- a/net/ipv4/netfilter/ipt_rpfilter.c > >+++ b/net/ipv4/netfilter/ipt_rpfilter.c > >@@ -40,7 +40,7 @@ static bool rpfilter_lookup_reverse(struct flowi4 *fl4, > > struct net *net = dev_net(dev); > > int ret __maybe_unused; > > > >- if (fib_lookup(net, fl4, &res)) > >+ if (fib_lookup(net, fl4, &res, 0)) > > return false; > > > > if (res.type != RTN_UNICAST) { > > Any rpfilter stuff can probably ignore the linkdown check since it is > possible that a driver could be flushing data just after a link went down. Agreed based on thoughts from __fib_validate_source. Thanks for this review, too. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 2/3 v2] net: ipv4 sysctl option to ignore routes when nexthop link is down 2015-06-10 6:47 ` [PATCH net-next 2/3 v2] net: ipv4 sysctl option to ignore routes when nexthop link is down Andy Gospodarek 2015-06-10 16:17 ` Alexander Duyck @ 2015-06-11 2:12 ` Scott Feldman 2015-06-11 2:18 ` Scott Feldman 1 sibling, 1 reply; 12+ messages in thread From: Scott Feldman @ 2015-06-11 2:12 UTC (permalink / raw) To: Andy Gospodarek Cc: Netdev, David S. Miller, ddutt, Alexander Duyck, Hannes Frederic Sowa, stephen@networkplumber.org On Tue, Jun 9, 2015 at 11:47 PM, Andy Gospodarek <gospo@cumulusnetworks.com> wrote: > /* /proc/sys/net/ipv4/netfilter */ > diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c > index 7e7746a..c9d0a0e 100644 > --- a/kernel/sysctl_binary.c > +++ b/kernel/sysctl_binary.c > @@ -253,6 +253,7 @@ static const struct bin_table bin_net_ipv4_conf_vars_table[] = { > { CTL_INT, NET_IPV4_CONF_NOPOLICY, "disable_policy" }, > { CTL_INT, NET_IPV4_CONF_FORCE_IGMP_VERSION, "force_igmp_version" }, > { CTL_INT, NET_IPV4_CONF_PROMOTE_SECONDARIES, "promote_secondaries" }, > + { CTL_INT, NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN, "ignore_routes_with_linkdown" }, Would "route_ignore_linkdown_nexthops" be a more accurate name? The patch marks link-downed nexthops to be ignored, not the route, correct? s/NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN/NET_IPV4_CONF_ROUTE_IGNORE_LINKDOWN_NEXTHOPS ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 2/3 v2] net: ipv4 sysctl option to ignore routes when nexthop link is down 2015-06-11 2:12 ` Scott Feldman @ 2015-06-11 2:18 ` Scott Feldman 2015-06-11 2:23 ` Andy Gospodarek 0 siblings, 1 reply; 12+ messages in thread From: Scott Feldman @ 2015-06-11 2:18 UTC (permalink / raw) To: Andy Gospodarek Cc: Netdev, David S. Miller, ddutt, Alexander Duyck, Hannes Frederic Sowa, stephen@networkplumber.org On Wed, Jun 10, 2015 at 7:12 PM, Scott Feldman <sfeldma@gmail.com> wrote: > On Tue, Jun 9, 2015 at 11:47 PM, Andy Gospodarek > <gospo@cumulusnetworks.com> wrote: > >> /* /proc/sys/net/ipv4/netfilter */ >> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c >> index 7e7746a..c9d0a0e 100644 >> --- a/kernel/sysctl_binary.c >> +++ b/kernel/sysctl_binary.c >> @@ -253,6 +253,7 @@ static const struct bin_table bin_net_ipv4_conf_vars_table[] = { >> { CTL_INT, NET_IPV4_CONF_NOPOLICY, "disable_policy" }, >> { CTL_INT, NET_IPV4_CONF_FORCE_IGMP_VERSION, "force_igmp_version" }, >> { CTL_INT, NET_IPV4_CONF_PROMOTE_SECONDARIES, "promote_secondaries" }, >> + { CTL_INT, NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN, "ignore_routes_with_linkdown" }, > > Would "route_ignore_linkdown_nexthops" be a more accurate name? The > patch marks link-downed nexthops to be ignored, not the route, > correct? > > s/NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN/NET_IPV4_CONF_ROUTE_IGNORE_LINKDOWN_NEXTHOPS Something like that. Not sure I like my suggestion. If dev is nexthop dev in route, and dev is link down, exclude nexthop in route lookup. route_exclude_if_linkdown_nexthop_dev? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 2/3 v2] net: ipv4 sysctl option to ignore routes when nexthop link is down 2015-06-11 2:18 ` Scott Feldman @ 2015-06-11 2:23 ` Andy Gospodarek 2015-06-11 2:57 ` Scott Feldman 0 siblings, 1 reply; 12+ messages in thread From: Andy Gospodarek @ 2015-06-11 2:23 UTC (permalink / raw) To: Scott Feldman Cc: Netdev, David S. Miller, ddutt, Alexander Duyck, Hannes Frederic Sowa, stephen@networkplumber.org On Wed, Jun 10, 2015 at 07:18:59PM -0700, Scott Feldman wrote: > On Wed, Jun 10, 2015 at 7:12 PM, Scott Feldman <sfeldma@gmail.com> wrote: > > On Tue, Jun 9, 2015 at 11:47 PM, Andy Gospodarek > > <gospo@cumulusnetworks.com> wrote: > > > >> /* /proc/sys/net/ipv4/netfilter */ > >> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c > >> index 7e7746a..c9d0a0e 100644 > >> --- a/kernel/sysctl_binary.c > >> +++ b/kernel/sysctl_binary.c > >> @@ -253,6 +253,7 @@ static const struct bin_table bin_net_ipv4_conf_vars_table[] = { > >> { CTL_INT, NET_IPV4_CONF_NOPOLICY, "disable_policy" }, > >> { CTL_INT, NET_IPV4_CONF_FORCE_IGMP_VERSION, "force_igmp_version" }, > >> { CTL_INT, NET_IPV4_CONF_PROMOTE_SECONDARIES, "promote_secondaries" }, > >> + { CTL_INT, NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN, "ignore_routes_with_linkdown" }, > > > > Would "route_ignore_linkdown_nexthops" be a more accurate name? The > > patch marks link-downed nexthops to be ignored, not the route, > > correct? > > > > s/NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN/NET_IPV4_CONF_ROUTE_IGNORE_LINKDOWN_NEXTHOPS > > Something like that. Not sure I like my suggestion. If dev is > nexthop dev in route, and dev is link down, exclude nexthop in route > lookup. I actually played around with a bunch of different names and this was as short as I could get it while still conveying the point. I'm getting ready to submit v3, so speak now if you really do hate the name proposed in v2. > route_exclude_if_linkdown_nexthop_dev? Not bad, but you see what I mean about it being a mouthful. :) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 2/3 v2] net: ipv4 sysctl option to ignore routes when nexthop link is down 2015-06-11 2:23 ` Andy Gospodarek @ 2015-06-11 2:57 ` Scott Feldman 0 siblings, 0 replies; 12+ messages in thread From: Scott Feldman @ 2015-06-11 2:57 UTC (permalink / raw) To: Andy Gospodarek Cc: Netdev, David S. Miller, ddutt, Alexander Duyck, Hannes Frederic Sowa, stephen@networkplumber.org On Wed, Jun 10, 2015 at 7:23 PM, Andy Gospodarek <gospo@cumulusnetworks.com> wrote: > On Wed, Jun 10, 2015 at 07:18:59PM -0700, Scott Feldman wrote: >> On Wed, Jun 10, 2015 at 7:12 PM, Scott Feldman <sfeldma@gmail.com> wrote: >> > On Tue, Jun 9, 2015 at 11:47 PM, Andy Gospodarek >> > <gospo@cumulusnetworks.com> wrote: >> > >> >> /* /proc/sys/net/ipv4/netfilter */ >> >> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c >> >> index 7e7746a..c9d0a0e 100644 >> >> --- a/kernel/sysctl_binary.c >> >> +++ b/kernel/sysctl_binary.c >> >> @@ -253,6 +253,7 @@ static const struct bin_table bin_net_ipv4_conf_vars_table[] = { >> >> { CTL_INT, NET_IPV4_CONF_NOPOLICY, "disable_policy" }, >> >> { CTL_INT, NET_IPV4_CONF_FORCE_IGMP_VERSION, "force_igmp_version" }, >> >> { CTL_INT, NET_IPV4_CONF_PROMOTE_SECONDARIES, "promote_secondaries" }, >> >> + { CTL_INT, NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN, "ignore_routes_with_linkdown" }, >> > >> > Would "route_ignore_linkdown_nexthops" be a more accurate name? The >> > patch marks link-downed nexthops to be ignored, not the route, >> > correct? >> > >> > s/NET_IPV4_CONF_IGNORE_ROUTES_WITH_LINKDOWN/NET_IPV4_CONF_ROUTE_IGNORE_LINKDOWN_NEXTHOPS >> >> Something like that. Not sure I like my suggestion. If dev is >> nexthop dev in route, and dev is link down, exclude nexthop in route >> lookup. > > I actually played around with a bunch of different names and this was as > short as I could get it while still conveying the point. I'm getting > ready to submit v3, so speak now if you really do hate the name proposed > in v2. > >> route_exclude_if_linkdown_nexthop_dev? > > Not bad, but you see what I mean about it being a mouthful. :) I think I'd rather see clarity over brevity in this case. The current name makes the attr sound like a route property but it's really a prop of the dev when the dev is a nexthop in a route and the dev link is down. ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH iproute2 3/3 v2] add support to print 'linkdown' nexthop flag 2015-06-10 6:47 [PATCH net-next 0/3 v2] changes to make ipv4 routing table aware of next-hop link status Andy Gospodarek 2015-06-10 6:47 ` [PATCH net-next 1/3 v2] net: track link-status of ipv4 nexthops Andy Gospodarek 2015-06-10 6:47 ` [PATCH net-next 2/3 v2] net: ipv4 sysctl option to ignore routes when nexthop link is down Andy Gospodarek @ 2015-06-10 6:47 ` Andy Gospodarek 2 siblings, 0 replies; 12+ messages in thread From: Andy Gospodarek @ 2015-06-10 6:47 UTC (permalink / raw) To: netdev, davem, ddutt, sfeldma, alexander.duyck, hannes, stephen Cc: Andy Gospodarek Signed-off-by: Andy Gospodaerk <gospo@cumulusnetworks.com> Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com> --- ip/iproute.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/ip/iproute.c b/ip/iproute.c index 3795baf..3369c49 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -451,6 +451,8 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) fprintf(fp, "offload "); if (r->rtm_flags & RTM_F_NOTIFY) fprintf(fp, "notify "); + if (r->rtm_flags & RTNH_F_LINKDOWN) + fprintf(fp, "linkdown "); if (tb[RTA_MARK]) { unsigned int mark = *(unsigned int*)RTA_DATA(tb[RTA_MARK]); if (mark) { @@ -670,6 +672,8 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) fprintf(fp, " onlink"); if (nh->rtnh_flags & RTNH_F_PERVASIVE) fprintf(fp, " pervasive"); + if (nh->rtnh_flags & RTNH_F_LINKDOWN) + fprintf(fp, " linkdown"); len -= NLMSG_ALIGN(nh->rtnh_len); nh = RTNH_NEXT(nh); } -- 1.9.3 ^ permalink raw reply related [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-06-11 2:57 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-06-10 6:47 [PATCH net-next 0/3 v2] changes to make ipv4 routing table aware of next-hop link status Andy Gospodarek 2015-06-10 6:47 ` [PATCH net-next 1/3 v2] net: track link-status of ipv4 nexthops Andy Gospodarek 2015-06-10 15:57 ` Alexander Duyck 2015-06-10 17:44 ` Andy Gospodarek 2015-06-10 6:47 ` [PATCH net-next 2/3 v2] net: ipv4 sysctl option to ignore routes when nexthop link is down Andy Gospodarek 2015-06-10 16:17 ` Alexander Duyck 2015-06-10 19:04 ` Andy Gospodarek 2015-06-11 2:12 ` Scott Feldman 2015-06-11 2:18 ` Scott Feldman 2015-06-11 2:23 ` Andy Gospodarek 2015-06-11 2:57 ` Scott Feldman 2015-06-10 6:47 ` [PATCH iproute2 3/3 v2] add support to print 'linkdown' nexthop flag Andy Gospodarek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).