* [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing @ 2015-09-23 19:49 Peter Nørlund 2015-09-23 19:49 ` [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath Peter Nørlund ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Peter Nørlund @ 2015-09-23 19:49 UTC (permalink / raw) To: netdev Cc: David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Peter Nørlund When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed from more or less being destination-based into being quasi-random per-packet scheduling. This increases the risk of out-of-order packets and makes it impossible to use multipath together with anycast services. This patch series replaces the old implementation with flow-based load balancing based on a hash over the source and destination addresses. Distribution of the hash is done with thresholds as described in RFC 2992. This reduces the disruption when a path is added/remove when having more than two paths. To futher the chance of successful usage in conjuction with anycast, ICMP error packets are hashed over the inner IP addresses. This ensures that PMTU will work together with anycast or load-balancers such as IPVS. Port numbers are not considered since fragments could cause problems with anycast and IPVS. Relying on the DF-flag for TCP packets is also insufficient, since ICMP inspection effectively extracts information from the opposite flow which might have a different state of the DF-flag. This is also why the RSS hash is not used. These are typically based on the NDIS RSS spec which mandates TCP support. Measurements of the additional overhead of a two-path multipath (p_mkroute_input excl. __mkroute_input) on a Xeon X3550 (4 cores, 2.66GHz): Original per-packet: ~394 cycles/packet L3 hash: ~76 cycles/packet Changes in v4: - Functions take hash directly instead of func ptr - Added inline hash function - Added dummy macros to minimize ifdefs - Use upper 31 bits of hash instead of lower Changes in v3: - Multipath algorithm is no longer configurable (always L3) - Added random seed to hash - Moved ICMP inspection to isolated function - Ignore source quench packets (deprecated as per RFC 6633) Changes in v2: - Replaced 8-bit xor hash with 31-bit jenkins hash - Don't scale weights (since 31-bit) - Avoided unnecesary renaming of variables - Rely on DF-bit instead of fragment offset when checking for fragmentation - upper_bound is now inclusive to avoid overflow - Use a callback to postpone extracting flow information until necessary - Skipped ICMP inspection entirely with L4 hashing - Handle newly added sysctl ignore_routes_with_linkdown Best Regards Peter Nørlund Peter Nørlund (2): ipv4: L3 hash-based multipath ipv4: ICMP packet inspection for multipath include/net/ip_fib.h | 14 ++++- include/net/route.h | 11 +++- net/ipv4/fib_semantics.c | 140 ++++++++++++++++++++++-------------------- net/ipv4/icmp.c | 19 +++++- net/ipv4/route.c | 63 +++++++++++++++++-- 5 files changed, 172 insertions(+), 75 deletions(-) ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath 2015-09-23 19:49 [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing Peter Nørlund @ 2015-09-23 19:49 ` Peter Nørlund 2015-09-23 21:00 ` Tom Herbert 2015-09-23 21:55 ` David Miller 2015-09-23 19:49 ` [PATCH v4 net-next 2/2] ipv4: ICMP packet inspection for multipath Peter Nørlund 2015-09-29 2:33 ` [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing David Miller 2 siblings, 2 replies; 16+ messages in thread From: Peter Nørlund @ 2015-09-23 19:49 UTC (permalink / raw) To: netdev Cc: David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Peter Nørlund Replaces the per-packet multipath with a hash-based multipath using source and destination address. Signed-off-by: Peter Nørlund <pch@ordbogen.com> --- include/net/ip_fib.h | 14 ++++- net/ipv4/fib_semantics.c | 140 +++++++++++++++++++++++++--------------------- net/ipv4/route.c | 16 ++++-- 3 files changed, 98 insertions(+), 72 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index a37d043..c454203 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -79,7 +79,7 @@ struct fib_nh { unsigned char nh_scope; #ifdef CONFIG_IP_ROUTE_MULTIPATH int nh_weight; - int nh_power; + atomic_t nh_upper_bound; #endif #ifdef CONFIG_IP_ROUTE_CLASSID __u32 nh_tclassid; @@ -118,7 +118,7 @@ struct fib_info { #define fib_advmss fib_metrics[RTAX_ADVMSS-1] int fib_nhs; #ifdef CONFIG_IP_ROUTE_MULTIPATH - int fib_power; + int fib_weight; #endif struct rcu_head rcu; struct fib_nh fib_nh[0]; @@ -312,7 +312,15 @@ int ip_fib_check_default(__be32 gw, struct net_device *dev); int fib_sync_down_dev(struct net_device *dev, unsigned long event); int fib_sync_down_addr(struct net *net, __be32 local); int fib_sync_up(struct net_device *dev, unsigned int nh_flags); -void fib_select_multipath(struct fib_result *res); + +extern u32 fib_multipath_secret __read_mostly; + +static inline int fib_multipath_hash(__be32 saddr, __be32 daddr) +{ + return jhash_2words(saddr, daddr, fib_multipath_secret) >> 1; +} + +void fib_select_multipath(struct fib_result *res, int hash); /* Exported by fib_trie.c */ void fib_trie_init(void); diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 064bd3c..0c49d2f 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -57,8 +57,7 @@ static unsigned int fib_info_cnt; static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE]; #ifdef CONFIG_IP_ROUTE_MULTIPATH - -static DEFINE_SPINLOCK(fib_multipath_lock); +u32 fib_multipath_secret __read_mostly; #define for_nexthops(fi) { \ int nhsel; const struct fib_nh *nh; \ @@ -532,7 +531,67 @@ errout: return ret; } -#endif +static void fib_rebalance(struct fib_info *fi) +{ + int total; + int w; + struct in_device *in_dev; + + if (fi->fib_nhs < 2) + return; + + total = 0; + for_nexthops(fi) { + if (nh->nh_flags & RTNH_F_DEAD) + continue; + + in_dev = __in_dev_get_rcu(nh->nh_dev); + + if (in_dev && + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && + nh->nh_flags & RTNH_F_LINKDOWN) + continue; + + total += nh->nh_weight; + } endfor_nexthops(fi); + + w = 0; + change_nexthops(fi) { + int upper_bound; + + in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev); + + if (nexthop_nh->nh_flags & RTNH_F_DEAD) { + upper_bound = -1; + } else if (in_dev && + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && + nexthop_nh->nh_flags & RTNH_F_LINKDOWN) { + upper_bound = -1; + } else { + w += nexthop_nh->nh_weight; + upper_bound = DIV_ROUND_CLOSEST(2147483648LL * w, + total) - 1; + } + + atomic_set(&nexthop_nh->nh_upper_bound, upper_bound); + } endfor_nexthops(fi); + + net_get_random_once(&fib_multipath_secret, + sizeof(fib_multipath_secret)); +} + +static inline void fib_add_weight(struct fib_info *fi, + const struct fib_nh *nh) +{ + fi->fib_weight += nh->nh_weight; +} + +#else /* CONFIG_IP_ROUTE_MULTIPATH */ + +#define fib_rebalance(fi) do { } while (0) +#define fib_add_weight(fi, nh) do { } while (0) + +#endif /* CONFIG_IP_ROUTE_MULTIPATH */ static int fib_encap_match(struct net *net, u16 encap_type, struct nlattr *encap, @@ -1094,8 +1153,11 @@ struct fib_info *fib_create_info(struct fib_config *cfg) change_nexthops(fi) { fib_info_update_nh_saddr(net, nexthop_nh); + fib_add_weight(fi, nexthop_nh); } endfor_nexthops(fi) + fib_rebalance(fi); + link_it: ofi = fib_find_info(fi); if (ofi) { @@ -1317,12 +1379,6 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event) nexthop_nh->nh_flags |= RTNH_F_LINKDOWN; break; } -#ifdef CONFIG_IP_ROUTE_MULTIPATH - spin_lock_bh(&fib_multipath_lock); - fi->fib_power -= nexthop_nh->nh_power; - nexthop_nh->nh_power = 0; - spin_unlock_bh(&fib_multipath_lock); -#endif dead++; } #ifdef CONFIG_IP_ROUTE_MULTIPATH @@ -1345,6 +1401,8 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event) } ret++; } + + fib_rebalance(fi); } return ret; @@ -1467,20 +1525,15 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags) !__in_dev_get_rtnl(dev)) continue; alive++; -#ifdef CONFIG_IP_ROUTE_MULTIPATH - spin_lock_bh(&fib_multipath_lock); - nexthop_nh->nh_power = 0; - nexthop_nh->nh_flags &= ~nh_flags; - spin_unlock_bh(&fib_multipath_lock); -#else nexthop_nh->nh_flags &= ~nh_flags; -#endif } endfor_nexthops(fi) if (alive > 0) { fi->fib_flags &= ~nh_flags; ret++; } + + fib_rebalance(fi); } return ret; @@ -1488,62 +1541,19 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags) #ifdef CONFIG_IP_ROUTE_MULTIPATH -/* - * The algorithm is suboptimal, but it provides really - * fair weighted route distribution. - */ -void fib_select_multipath(struct fib_result *res) +void fib_select_multipath(struct fib_result *res, int hash) { struct fib_info *fi = res->fi; - struct in_device *in_dev; - int w; - - spin_lock_bh(&fib_multipath_lock); - if (fi->fib_power <= 0) { - int power = 0; - change_nexthops(fi) { - in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev); - if (nexthop_nh->nh_flags & RTNH_F_DEAD) - continue; - if (in_dev && - IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && - nexthop_nh->nh_flags & RTNH_F_LINKDOWN) - continue; - power += nexthop_nh->nh_weight; - nexthop_nh->nh_power = nexthop_nh->nh_weight; - } endfor_nexthops(fi); - fi->fib_power = power; - if (power <= 0) { - spin_unlock_bh(&fib_multipath_lock); - /* Race condition: route has just become dead. */ - res->nh_sel = 0; - return; - } - } - - /* w should be random number [0..fi->fib_power-1], - * it is pretty bad approximation. - */ - - w = jiffies % fi->fib_power; + for_nexthops(fi) { + if (hash > atomic_read(&nh->nh_upper_bound)) + continue; - change_nexthops(fi) { - if (!(nexthop_nh->nh_flags & RTNH_F_DEAD) && - nexthop_nh->nh_power) { - w -= nexthop_nh->nh_power; - if (w <= 0) { - nexthop_nh->nh_power--; - fi->fib_power--; - res->nh_sel = nhsel; - spin_unlock_bh(&fib_multipath_lock); - return; - } - } + res->nh_sel = nhsel; + return; } endfor_nexthops(fi); /* Race condition: route has just become dead. */ res->nh_sel = 0; - spin_unlock_bh(&fib_multipath_lock); } #endif diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 80f7c5b..3dc83b6 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1659,8 +1659,12 @@ static int ip_mkroute_input(struct sk_buff *skb, __be32 daddr, __be32 saddr, u32 tos) { #ifdef CONFIG_IP_ROUTE_MULTIPATH - if (res->fi && res->fi->fib_nhs > 1) - fib_select_multipath(res); + if (res->fi && res->fi->fib_nhs > 1) { + int h; + + h = fib_multipath_hash(saddr, daddr); + fib_select_multipath(res, h); + } #endif /* create a routing cache entry */ @@ -2190,8 +2194,12 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) } #ifdef CONFIG_IP_ROUTE_MULTIPATH - if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) - fib_select_multipath(&res); + if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) { + int h; + + h = fib_multipath_hash(fl4->saddr, fl4->daddr); + fib_select_multipath(&res, h); + } else #endif if (!res.prefixlen && -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath 2015-09-23 19:49 ` [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath Peter Nørlund @ 2015-09-23 21:00 ` Tom Herbert 2015-09-23 21:43 ` Peter Nørlund 2015-09-23 23:09 ` Peter Nørlund 2015-09-23 21:55 ` David Miller 1 sibling, 2 replies; 16+ messages in thread From: Tom Herbert @ 2015-09-23 21:00 UTC (permalink / raw) To: Peter Nørlund Cc: Linux Kernel Network Developers, David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com> wrote: > Replaces the per-packet multipath with a hash-based multipath using > source and destination address. > It's good that round robin is going away, but this still looks very different with how multipath routing is done done in IPv6 (rt6_multipath_select and rt6_info_hash_nhsfn). For instance IPv4 hashes addresses, but IPv6 includes ports. How can we rectify this? > Signed-off-by: Peter Nørlund <pch@ordbogen.com> > --- > include/net/ip_fib.h | 14 ++++- > net/ipv4/fib_semantics.c | 140 +++++++++++++++++++++++++--------------------- > net/ipv4/route.c | 16 ++++-- > 3 files changed, 98 insertions(+), 72 deletions(-) > > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h > index a37d043..c454203 100644 > --- a/include/net/ip_fib.h > +++ b/include/net/ip_fib.h > @@ -79,7 +79,7 @@ struct fib_nh { > unsigned char nh_scope; > #ifdef CONFIG_IP_ROUTE_MULTIPATH > int nh_weight; > - int nh_power; > + atomic_t nh_upper_bound; > #endif > #ifdef CONFIG_IP_ROUTE_CLASSID > __u32 nh_tclassid; > @@ -118,7 +118,7 @@ struct fib_info { > #define fib_advmss fib_metrics[RTAX_ADVMSS-1] > int fib_nhs; > #ifdef CONFIG_IP_ROUTE_MULTIPATH > - int fib_power; > + int fib_weight; > #endif > struct rcu_head rcu; > struct fib_nh fib_nh[0]; > @@ -312,7 +312,15 @@ int ip_fib_check_default(__be32 gw, struct net_device *dev); > int fib_sync_down_dev(struct net_device *dev, unsigned long event); > int fib_sync_down_addr(struct net *net, __be32 local); > int fib_sync_up(struct net_device *dev, unsigned int nh_flags); > -void fib_select_multipath(struct fib_result *res); > + > +extern u32 fib_multipath_secret __read_mostly; > + > +static inline int fib_multipath_hash(__be32 saddr, __be32 daddr) > +{ > + return jhash_2words(saddr, daddr, fib_multipath_secret) >> 1; Why the >> 1? > +} > + > +void fib_select_multipath(struct fib_result *res, int hash); > > /* Exported by fib_trie.c */ > void fib_trie_init(void); > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c > index 064bd3c..0c49d2f 100644 > --- a/net/ipv4/fib_semantics.c > +++ b/net/ipv4/fib_semantics.c > @@ -57,8 +57,7 @@ static unsigned int fib_info_cnt; > static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE]; > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > - > -static DEFINE_SPINLOCK(fib_multipath_lock); > +u32 fib_multipath_secret __read_mostly; > > #define for_nexthops(fi) { \ > int nhsel; const struct fib_nh *nh; \ > @@ -532,7 +531,67 @@ errout: > return ret; > } > > -#endif > +static void fib_rebalance(struct fib_info *fi) > +{ > + int total; > + int w; > + struct in_device *in_dev; > + > + if (fi->fib_nhs < 2) > + return; > + > + total = 0; > + for_nexthops(fi) { > + if (nh->nh_flags & RTNH_F_DEAD) > + continue; > + > + in_dev = __in_dev_get_rcu(nh->nh_dev); > + > + if (in_dev && > + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > + nh->nh_flags & RTNH_F_LINKDOWN) > + continue; > + > + total += nh->nh_weight; > + } endfor_nexthops(fi); > + > + w = 0; > + change_nexthops(fi) { > + int upper_bound; > + > + in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev); > + > + if (nexthop_nh->nh_flags & RTNH_F_DEAD) { > + upper_bound = -1; > + } else if (in_dev && > + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > + nexthop_nh->nh_flags & RTNH_F_LINKDOWN) { > + upper_bound = -1; > + } else { > + w += nexthop_nh->nh_weight; > + upper_bound = DIV_ROUND_CLOSEST(2147483648LL * w, > + total) - 1; > + } > + Is this complexity (aside from recomputing the key) needed because you want consistent hashing for the anycast case? > + atomic_set(&nexthop_nh->nh_upper_bound, upper_bound); > + } endfor_nexthops(fi); > + > + net_get_random_once(&fib_multipath_secret, > + sizeof(fib_multipath_secret)); > +} > + > +static inline void fib_add_weight(struct fib_info *fi, > + const struct fib_nh *nh) > +{ > + fi->fib_weight += nh->nh_weight; > +} > + > +#else /* CONFIG_IP_ROUTE_MULTIPATH */ > + > +#define fib_rebalance(fi) do { } while (0) > +#define fib_add_weight(fi, nh) do { } while (0) > + > +#endif /* CONFIG_IP_ROUTE_MULTIPATH */ > > static int fib_encap_match(struct net *net, u16 encap_type, > struct nlattr *encap, > @@ -1094,8 +1153,11 @@ struct fib_info *fib_create_info(struct fib_config *cfg) > > change_nexthops(fi) { > fib_info_update_nh_saddr(net, nexthop_nh); > + fib_add_weight(fi, nexthop_nh); > } endfor_nexthops(fi) > > + fib_rebalance(fi); > + > link_it: > ofi = fib_find_info(fi); > if (ofi) { > @@ -1317,12 +1379,6 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event) > nexthop_nh->nh_flags |= RTNH_F_LINKDOWN; > break; > } > -#ifdef CONFIG_IP_ROUTE_MULTIPATH > - spin_lock_bh(&fib_multipath_lock); > - fi->fib_power -= nexthop_nh->nh_power; > - nexthop_nh->nh_power = 0; > - spin_unlock_bh(&fib_multipath_lock); > -#endif > dead++; > } > #ifdef CONFIG_IP_ROUTE_MULTIPATH > @@ -1345,6 +1401,8 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event) > } > ret++; > } > + > + fib_rebalance(fi); > } > > return ret; > @@ -1467,20 +1525,15 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags) > !__in_dev_get_rtnl(dev)) > continue; > alive++; > -#ifdef CONFIG_IP_ROUTE_MULTIPATH > - spin_lock_bh(&fib_multipath_lock); > - nexthop_nh->nh_power = 0; > - nexthop_nh->nh_flags &= ~nh_flags; > - spin_unlock_bh(&fib_multipath_lock); > -#else > nexthop_nh->nh_flags &= ~nh_flags; > -#endif > } endfor_nexthops(fi) > > if (alive > 0) { > fi->fib_flags &= ~nh_flags; > ret++; > } > + > + fib_rebalance(fi); > } > > return ret; > @@ -1488,62 +1541,19 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags) > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > > -/* > - * The algorithm is suboptimal, but it provides really > - * fair weighted route distribution. > - */ > -void fib_select_multipath(struct fib_result *res) > +void fib_select_multipath(struct fib_result *res, int hash) > { > struct fib_info *fi = res->fi; > - struct in_device *in_dev; > - int w; > - > - spin_lock_bh(&fib_multipath_lock); > - if (fi->fib_power <= 0) { > - int power = 0; > - change_nexthops(fi) { > - in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev); > - if (nexthop_nh->nh_flags & RTNH_F_DEAD) > - continue; > - if (in_dev && > - IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > - nexthop_nh->nh_flags & RTNH_F_LINKDOWN) > - continue; > - power += nexthop_nh->nh_weight; > - nexthop_nh->nh_power = nexthop_nh->nh_weight; > - } endfor_nexthops(fi); > - fi->fib_power = power; > - if (power <= 0) { > - spin_unlock_bh(&fib_multipath_lock); > - /* Race condition: route has just become dead. */ > - res->nh_sel = 0; > - return; > - } > - } > - > > - /* w should be random number [0..fi->fib_power-1], > - * it is pretty bad approximation. > - */ > - > - w = jiffies % fi->fib_power; > + for_nexthops(fi) { > + if (hash > atomic_read(&nh->nh_upper_bound)) > + continue; > > - change_nexthops(fi) { > - if (!(nexthop_nh->nh_flags & RTNH_F_DEAD) && > - nexthop_nh->nh_power) { > - w -= nexthop_nh->nh_power; > - if (w <= 0) { > - nexthop_nh->nh_power--; > - fi->fib_power--; > - res->nh_sel = nhsel; > - spin_unlock_bh(&fib_multipath_lock); > - return; > - } > - } > + res->nh_sel = nhsel; > + return; > } endfor_nexthops(fi); > > /* Race condition: route has just become dead. */ > res->nh_sel = 0; > - spin_unlock_bh(&fib_multipath_lock); > } > #endif > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > index 80f7c5b..3dc83b6 100644 > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -1659,8 +1659,12 @@ static int ip_mkroute_input(struct sk_buff *skb, > __be32 daddr, __be32 saddr, u32 tos) > { > #ifdef CONFIG_IP_ROUTE_MULTIPATH > - if (res->fi && res->fi->fib_nhs > 1) > - fib_select_multipath(res); > + if (res->fi && res->fi->fib_nhs > 1) { > + int h; > + > + h = fib_multipath_hash(saddr, daddr); > + fib_select_multipath(res, h); > + } > #endif > > /* create a routing cache entry */ > @@ -2190,8 +2194,12 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) > } > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > - if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) > - fib_select_multipath(&res); > + if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) { > + int h; > + > + h = fib_multipath_hash(fl4->saddr, fl4->daddr); > + fib_select_multipath(&res, h); > + } > else > #endif > if (!res.prefixlen && > -- > 1.7.10.4 > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath 2015-09-23 21:00 ` Tom Herbert @ 2015-09-23 21:43 ` Peter Nørlund 2015-09-23 23:12 ` Tom Herbert 2015-09-23 23:09 ` Peter Nørlund 1 sibling, 1 reply; 16+ messages in thread From: Peter Nørlund @ 2015-09-23 21:43 UTC (permalink / raw) To: Tom Herbert Cc: Linux Kernel Network Developers, David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy On Wed, 23 Sep 2015 14:00:43 -0700 Tom Herbert <tom@herbertland.com> wrote: > On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com> > wrote: > > Replaces the per-packet multipath with a hash-based multipath using > > source and destination address. > > > It's good that round robin is going away, but this still looks very > different with how multipath routing is done done in IPv6 > (rt6_multipath_select and rt6_info_hash_nhsfn). For instance IPv4 > hashes addresses, but IPv6 includes ports. How can we rectify this? > > > Signed-off-by: Peter Nørlund <pch@ordbogen.com> > > --- > > include/net/ip_fib.h | 14 ++++- > > net/ipv4/fib_semantics.c | 140 > > +++++++++++++++++++++++++--------------------- > > net/ipv4/route.c | 16 ++++-- 3 files changed, 98 > > insertions(+), 72 deletions(-) > > > > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h > > index a37d043..c454203 100644 > > --- a/include/net/ip_fib.h > > +++ b/include/net/ip_fib.h > > @@ -79,7 +79,7 @@ struct fib_nh { > > unsigned char nh_scope; > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > > int nh_weight; > > - int nh_power; > > + atomic_t nh_upper_bound; > > #endif > > #ifdef CONFIG_IP_ROUTE_CLASSID > > __u32 nh_tclassid; > > @@ -118,7 +118,7 @@ struct fib_info { > > #define fib_advmss fib_metrics[RTAX_ADVMSS-1] > > int fib_nhs; > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > > - int fib_power; > > + int fib_weight; > > #endif > > struct rcu_head rcu; > > struct fib_nh fib_nh[0]; > > @@ -312,7 +312,15 @@ int ip_fib_check_default(__be32 gw, struct > > net_device *dev); int fib_sync_down_dev(struct net_device *dev, > > unsigned long event); int fib_sync_down_addr(struct net *net, > > __be32 local); int fib_sync_up(struct net_device *dev, unsigned int > > nh_flags); -void fib_select_multipath(struct fib_result *res); > > + > > +extern u32 fib_multipath_secret __read_mostly; > > + > > +static inline int fib_multipath_hash(__be32 saddr, __be32 daddr) > > +{ > > + return jhash_2words(saddr, daddr, fib_multipath_secret) >> > > 1; > > Why the >> 1? Each nexthop is assigned a range within the hash key space with the upper bound specified in nh_upper_bound, but I needed a way to show that a nexthop was dead within that same atomic variable. I chose -1 reducing the actual hash to 31 bits. I shift the hash because I expect it to be slightly faster than AND'ing with 0x7FFFFFFF on most platforms. > > > +} > > + > > +void fib_select_multipath(struct fib_result *res, int hash); > > > > /* Exported by fib_trie.c */ > > void fib_trie_init(void); > > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c > > index 064bd3c..0c49d2f 100644 > > --- a/net/ipv4/fib_semantics.c > > +++ b/net/ipv4/fib_semantics.c > > @@ -57,8 +57,7 @@ static unsigned int fib_info_cnt; > > static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE]; > > > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > > - > > -static DEFINE_SPINLOCK(fib_multipath_lock); > > +u32 fib_multipath_secret __read_mostly; > > > > #define for_nexthops(fi) > > { \ int nhsel; const > > struct fib_nh *nh; \ @@ -532,7 +531,67 > > @@ errout: return ret; > > } > > > > -#endif > > +static void fib_rebalance(struct fib_info *fi) > > +{ > > + int total; > > + int w; > > + struct in_device *in_dev; > > + > > + if (fi->fib_nhs < 2) > > + return; > > + > > + total = 0; > > + for_nexthops(fi) { > > + if (nh->nh_flags & RTNH_F_DEAD) > > + continue; > > + > > + in_dev = __in_dev_get_rcu(nh->nh_dev); > > + > > + if (in_dev && > > + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > > + nh->nh_flags & RTNH_F_LINKDOWN) > > + continue; > > + > > + total += nh->nh_weight; > > + } endfor_nexthops(fi); > > + > > + w = 0; > > + change_nexthops(fi) { > > + int upper_bound; > > + > > + in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev); > > + > > + if (nexthop_nh->nh_flags & RTNH_F_DEAD) { > > + upper_bound = -1; > > + } else if (in_dev && > > + > > IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > > + nexthop_nh->nh_flags & RTNH_F_LINKDOWN) { > > + upper_bound = -1; > > + } else { > > + w += nexthop_nh->nh_weight; > > + upper_bound = > > DIV_ROUND_CLOSEST(2147483648LL * w, > > + total) - 1; > > + } > > + > > Is this complexity (aside from recomputing the key) needed because you > want consistent hashing for the anycast case? It is an attempt to reduce the disruption when a path is added or removed in a way recommended by RFC 2992. It is not specific to anycast at all, although disruptions are more severe for anycast than for ordinary TCP which might just slow down due to out of order packets. Traditionally people used modulo to calculate the path from the hash, but with the modulo approach 1/2 of all flows would change paths when a path is added or removed. With the hash threshold approach, somewhere between 1/4 and 1/2 of all flows are affected. So an improvement for routes with more than two paths. Ideally you would have a hash-to-port mapping table or similar, ensuring that only 1/N of all flows were affected (with N being the number of paths). > > > + atomic_set(&nexthop_nh->nh_upper_bound, > > upper_bound); > > + } endfor_nexthops(fi); > > + > > + net_get_random_once(&fib_multipath_secret, > > + sizeof(fib_multipath_secret)); > > +} > > + > > +static inline void fib_add_weight(struct fib_info *fi, > > + const struct fib_nh *nh) > > +{ > > + fi->fib_weight += nh->nh_weight; > > +} > > + > > +#else /* CONFIG_IP_ROUTE_MULTIPATH */ > > + > > +#define fib_rebalance(fi) do { } while (0) > > +#define fib_add_weight(fi, nh) do { } while (0) > > + > > +#endif /* CONFIG_IP_ROUTE_MULTIPATH */ > > > > static int fib_encap_match(struct net *net, u16 encap_type, > > struct nlattr *encap, > > @@ -1094,8 +1153,11 @@ struct fib_info *fib_create_info(struct > > fib_config *cfg) > > > > change_nexthops(fi) { > > fib_info_update_nh_saddr(net, nexthop_nh); > > + fib_add_weight(fi, nexthop_nh); > > } endfor_nexthops(fi) > > > > + fib_rebalance(fi); > > + > > link_it: > > ofi = fib_find_info(fi); > > if (ofi) { > > @@ -1317,12 +1379,6 @@ int fib_sync_down_dev(struct net_device > > *dev, unsigned long event) nexthop_nh->nh_flags |= RTNH_F_LINKDOWN; > > break; > > } > > -#ifdef CONFIG_IP_ROUTE_MULTIPATH > > - spin_lock_bh(&fib_multipath_lock); > > - fi->fib_power -= > > nexthop_nh->nh_power; > > - nexthop_nh->nh_power = 0; > > - spin_unlock_bh(&fib_multipath_lock); > > -#endif > > dead++; > > } > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > > @@ -1345,6 +1401,8 @@ int fib_sync_down_dev(struct net_device *dev, > > unsigned long event) } > > ret++; > > } > > + > > + fib_rebalance(fi); > > } > > > > return ret; > > @@ -1467,20 +1525,15 @@ int fib_sync_up(struct net_device *dev, > > unsigned int nh_flags) !__in_dev_get_rtnl(dev)) > > continue; > > alive++; > > -#ifdef CONFIG_IP_ROUTE_MULTIPATH > > - spin_lock_bh(&fib_multipath_lock); > > - nexthop_nh->nh_power = 0; > > - nexthop_nh->nh_flags &= ~nh_flags; > > - spin_unlock_bh(&fib_multipath_lock); > > -#else > > nexthop_nh->nh_flags &= ~nh_flags; > > -#endif > > } endfor_nexthops(fi) > > > > if (alive > 0) { > > fi->fib_flags &= ~nh_flags; > > ret++; > > } > > + > > + fib_rebalance(fi); > > } > > > > return ret; > > @@ -1488,62 +1541,19 @@ int fib_sync_up(struct net_device *dev, > > unsigned int nh_flags) > > > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > > > > -/* > > - * The algorithm is suboptimal, but it provides really > > - * fair weighted route distribution. > > - */ > > -void fib_select_multipath(struct fib_result *res) > > +void fib_select_multipath(struct fib_result *res, int hash) > > { > > struct fib_info *fi = res->fi; > > - struct in_device *in_dev; > > - int w; > > - > > - spin_lock_bh(&fib_multipath_lock); > > - if (fi->fib_power <= 0) { > > - int power = 0; > > - change_nexthops(fi) { > > - in_dev = > > __in_dev_get_rcu(nexthop_nh->nh_dev); > > - if (nexthop_nh->nh_flags & RTNH_F_DEAD) > > - continue; > > - if (in_dev && > > - > > IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > > - nexthop_nh->nh_flags & RTNH_F_LINKDOWN) > > - continue; > > - power += nexthop_nh->nh_weight; > > - nexthop_nh->nh_power = > > nexthop_nh->nh_weight; > > - } endfor_nexthops(fi); > > - fi->fib_power = power; > > - if (power <= 0) { > > - spin_unlock_bh(&fib_multipath_lock); > > - /* Race condition: route has just become > > dead. */ > > - res->nh_sel = 0; > > - return; > > - } > > - } > > - > > > > - /* w should be random number [0..fi->fib_power-1], > > - * it is pretty bad approximation. > > - */ > > - > > - w = jiffies % fi->fib_power; > > + for_nexthops(fi) { > > + if (hash > atomic_read(&nh->nh_upper_bound)) > > + continue; > > > > - change_nexthops(fi) { > > - if (!(nexthop_nh->nh_flags & RTNH_F_DEAD) && > > - nexthop_nh->nh_power) { > > - w -= nexthop_nh->nh_power; > > - if (w <= 0) { > > - nexthop_nh->nh_power--; > > - fi->fib_power--; > > - res->nh_sel = nhsel; > > - spin_unlock_bh(&fib_multipath_lock); > > - return; > > - } > > - } > > + res->nh_sel = nhsel; > > + return; > > } endfor_nexthops(fi); > > > > /* Race condition: route has just become dead. */ > > res->nh_sel = 0; > > - spin_unlock_bh(&fib_multipath_lock); > > } > > #endif > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > > index 80f7c5b..3dc83b6 100644 > > --- a/net/ipv4/route.c > > +++ b/net/ipv4/route.c > > @@ -1659,8 +1659,12 @@ static int ip_mkroute_input(struct sk_buff > > *skb, __be32 daddr, __be32 saddr, u32 tos) > > { > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > > - if (res->fi && res->fi->fib_nhs > 1) > > - fib_select_multipath(res); > > + if (res->fi && res->fi->fib_nhs > 1) { > > + int h; > > + > > + h = fib_multipath_hash(saddr, daddr); > > + fib_select_multipath(res, h); > > + } > > #endif > > > > /* create a routing cache entry */ > > @@ -2190,8 +2194,12 @@ struct rtable *__ip_route_output_key(struct > > net *net, struct flowi4 *fl4) } > > > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > > - if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) > > - fib_select_multipath(&res); > > + if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) { > > + int h; > > + > > + h = fib_multipath_hash(fl4->saddr, fl4->daddr); > > + fib_select_multipath(&res, h); > > + } > > else > > #endif > > if (!res.prefixlen && > > -- > > 1.7.10.4 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath 2015-09-23 21:43 ` Peter Nørlund @ 2015-09-23 23:12 ` Tom Herbert 2015-09-23 23:38 ` Peter Nørlund 0 siblings, 1 reply; 16+ messages in thread From: Tom Herbert @ 2015-09-23 23:12 UTC (permalink / raw) To: Peter Nørlund Cc: Linux Kernel Network Developers, David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy On Wed, Sep 23, 2015 at 2:43 PM, Peter Nørlund <pch@ordbogen.com> wrote: > On Wed, 23 Sep 2015 14:00:43 -0700 > Tom Herbert <tom@herbertland.com> wrote: > >> On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com> >> wrote: >> > Replaces the per-packet multipath with a hash-based multipath using >> > source and destination address. >> > >> It's good that round robin is going away, but this still looks very >> different with how multipath routing is done done in IPv6 >> (rt6_multipath_select and rt6_info_hash_nhsfn). For instance IPv4 >> hashes addresses, but IPv6 includes ports. How can we rectify this? >> >> > Signed-off-by: Peter Nørlund <pch@ordbogen.com> >> > --- >> > include/net/ip_fib.h | 14 ++++- >> > net/ipv4/fib_semantics.c | 140 >> > +++++++++++++++++++++++++--------------------- >> > net/ipv4/route.c | 16 ++++-- 3 files changed, 98 >> > insertions(+), 72 deletions(-) >> > >> > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h >> > index a37d043..c454203 100644 >> > --- a/include/net/ip_fib.h >> > +++ b/include/net/ip_fib.h >> > @@ -79,7 +79,7 @@ struct fib_nh { >> > unsigned char nh_scope; >> > #ifdef CONFIG_IP_ROUTE_MULTIPATH >> > int nh_weight; >> > - int nh_power; >> > + atomic_t nh_upper_bound; >> > #endif >> > #ifdef CONFIG_IP_ROUTE_CLASSID >> > __u32 nh_tclassid; >> > @@ -118,7 +118,7 @@ struct fib_info { >> > #define fib_advmss fib_metrics[RTAX_ADVMSS-1] >> > int fib_nhs; >> > #ifdef CONFIG_IP_ROUTE_MULTIPATH >> > - int fib_power; >> > + int fib_weight; >> > #endif >> > struct rcu_head rcu; >> > struct fib_nh fib_nh[0]; >> > @@ -312,7 +312,15 @@ int ip_fib_check_default(__be32 gw, struct >> > net_device *dev); int fib_sync_down_dev(struct net_device *dev, >> > unsigned long event); int fib_sync_down_addr(struct net *net, >> > __be32 local); int fib_sync_up(struct net_device *dev, unsigned int >> > nh_flags); -void fib_select_multipath(struct fib_result *res); >> > + >> > +extern u32 fib_multipath_secret __read_mostly; >> > + >> > +static inline int fib_multipath_hash(__be32 saddr, __be32 daddr) >> > +{ >> > + return jhash_2words(saddr, daddr, fib_multipath_secret) >> >> > 1; >> >> Why the >> 1? > > Each nexthop is assigned a range within the hash key space with the > upper bound specified in nh_upper_bound, but I needed a way to show > that a nexthop was dead within that same atomic variable. I chose -1 > reducing the actual hash to 31 bits. I shift the hash because I expect > it to be slightly faster than AND'ing with 0x7FFFFFFF on most platforms. > >> >> > +} >> > + >> > +void fib_select_multipath(struct fib_result *res, int hash); >> > >> > /* Exported by fib_trie.c */ >> > void fib_trie_init(void); >> > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c >> > index 064bd3c..0c49d2f 100644 >> > --- a/net/ipv4/fib_semantics.c >> > +++ b/net/ipv4/fib_semantics.c >> > @@ -57,8 +57,7 @@ static unsigned int fib_info_cnt; >> > static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE]; >> > >> > #ifdef CONFIG_IP_ROUTE_MULTIPATH >> > - >> > -static DEFINE_SPINLOCK(fib_multipath_lock); >> > +u32 fib_multipath_secret __read_mostly; >> > >> > #define for_nexthops(fi) >> > { \ int nhsel; const >> > struct fib_nh *nh; \ @@ -532,7 +531,67 >> > @@ errout: return ret; >> > } >> > >> > -#endif >> > +static void fib_rebalance(struct fib_info *fi) >> > +{ >> > + int total; >> > + int w; >> > + struct in_device *in_dev; >> > + >> > + if (fi->fib_nhs < 2) >> > + return; >> > + >> > + total = 0; >> > + for_nexthops(fi) { >> > + if (nh->nh_flags & RTNH_F_DEAD) >> > + continue; >> > + >> > + in_dev = __in_dev_get_rcu(nh->nh_dev); >> > + >> > + if (in_dev && >> > + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && >> > + nh->nh_flags & RTNH_F_LINKDOWN) >> > + continue; >> > + >> > + total += nh->nh_weight; >> > + } endfor_nexthops(fi); >> > + >> > + w = 0; >> > + change_nexthops(fi) { >> > + int upper_bound; >> > + >> > + in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev); >> > + >> > + if (nexthop_nh->nh_flags & RTNH_F_DEAD) { >> > + upper_bound = -1; >> > + } else if (in_dev && >> > + >> > IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && >> > + nexthop_nh->nh_flags & RTNH_F_LINKDOWN) { >> > + upper_bound = -1; >> > + } else { >> > + w += nexthop_nh->nh_weight; >> > + upper_bound = >> > DIV_ROUND_CLOSEST(2147483648LL * w, >> > + total) - 1; >> > + } >> > + >> >> Is this complexity (aside from recomputing the key) needed because you >> want consistent hashing for the anycast case? > > It is an attempt to reduce the disruption when a path is added or > removed in a way recommended by RFC 2992. It is not specific to anycast > at all, although disruptions are more severe for anycast than for > ordinary TCP which might just slow down due to out of order packets. > RFC2992 looks more like a mathematical proof than a practical algorithm :-) Anyway, the requirement to avoid OOO with TCP is so often over blown. TCP does not keel over dead with OOO packets, it just may experience a blip of being suboptimal. As long as we're not thrashing paths things should be fine quickly, and if we are thrashing links rather each change affects 1/2 or 1/4 of connections probably doesn't matter. IMO no one should complain about just doing simple modulo on the hash. However, if you're saying that adding a removing a path can cause complete loss of connectivity for 1/2 of connections using anycast that is a little more worrisome (this is essentially the same problem we have with SO_REUSEPORT and TCP). In that case seems like you'd want the 1/N solution. Tom > Traditionally people used modulo to calculate the path from the hash, > but with the modulo approach 1/2 of all flows would change paths when a > path is added or removed. With the hash threshold approach, somewhere > between 1/4 and 1/2 of all flows are affected. So an improvement for > routes with more than two paths. > > Ideally you would have a hash-to-port mapping table or similar, > ensuring that only 1/N of all flows were affected (with N being the > number of paths). > >> >> > + atomic_set(&nexthop_nh->nh_upper_bound, >> > upper_bound); >> > + } endfor_nexthops(fi); >> > + >> > + net_get_random_once(&fib_multipath_secret, >> > + sizeof(fib_multipath_secret)); >> > +} >> > + >> > +static inline void fib_add_weight(struct fib_info *fi, >> > + const struct fib_nh *nh) >> > +{ >> > + fi->fib_weight += nh->nh_weight; >> > +} >> > + >> > +#else /* CONFIG_IP_ROUTE_MULTIPATH */ >> > + >> > +#define fib_rebalance(fi) do { } while (0) >> > +#define fib_add_weight(fi, nh) do { } while (0) >> > + >> > +#endif /* CONFIG_IP_ROUTE_MULTIPATH */ >> > >> > static int fib_encap_match(struct net *net, u16 encap_type, >> > struct nlattr *encap, >> > @@ -1094,8 +1153,11 @@ struct fib_info *fib_create_info(struct >> > fib_config *cfg) >> > >> > change_nexthops(fi) { >> > fib_info_update_nh_saddr(net, nexthop_nh); >> > + fib_add_weight(fi, nexthop_nh); >> > } endfor_nexthops(fi) >> > >> > + fib_rebalance(fi); >> > + >> > link_it: >> > ofi = fib_find_info(fi); >> > if (ofi) { >> > @@ -1317,12 +1379,6 @@ int fib_sync_down_dev(struct net_device >> > *dev, unsigned long event) nexthop_nh->nh_flags |= RTNH_F_LINKDOWN; >> > break; >> > } >> > -#ifdef CONFIG_IP_ROUTE_MULTIPATH >> > - spin_lock_bh(&fib_multipath_lock); >> > - fi->fib_power -= >> > nexthop_nh->nh_power; >> > - nexthop_nh->nh_power = 0; >> > - spin_unlock_bh(&fib_multipath_lock); >> > -#endif >> > dead++; >> > } >> > #ifdef CONFIG_IP_ROUTE_MULTIPATH >> > @@ -1345,6 +1401,8 @@ int fib_sync_down_dev(struct net_device *dev, >> > unsigned long event) } >> > ret++; >> > } >> > + >> > + fib_rebalance(fi); >> > } >> > >> > return ret; >> > @@ -1467,20 +1525,15 @@ int fib_sync_up(struct net_device *dev, >> > unsigned int nh_flags) !__in_dev_get_rtnl(dev)) >> > continue; >> > alive++; >> > -#ifdef CONFIG_IP_ROUTE_MULTIPATH >> > - spin_lock_bh(&fib_multipath_lock); >> > - nexthop_nh->nh_power = 0; >> > - nexthop_nh->nh_flags &= ~nh_flags; >> > - spin_unlock_bh(&fib_multipath_lock); >> > -#else >> > nexthop_nh->nh_flags &= ~nh_flags; >> > -#endif >> > } endfor_nexthops(fi) >> > >> > if (alive > 0) { >> > fi->fib_flags &= ~nh_flags; >> > ret++; >> > } >> > + >> > + fib_rebalance(fi); >> > } >> > >> > return ret; >> > @@ -1488,62 +1541,19 @@ int fib_sync_up(struct net_device *dev, >> > unsigned int nh_flags) >> > >> > #ifdef CONFIG_IP_ROUTE_MULTIPATH >> > >> > -/* >> > - * The algorithm is suboptimal, but it provides really >> > - * fair weighted route distribution. >> > - */ >> > -void fib_select_multipath(struct fib_result *res) >> > +void fib_select_multipath(struct fib_result *res, int hash) >> > { >> > struct fib_info *fi = res->fi; >> > - struct in_device *in_dev; >> > - int w; >> > - >> > - spin_lock_bh(&fib_multipath_lock); >> > - if (fi->fib_power <= 0) { >> > - int power = 0; >> > - change_nexthops(fi) { >> > - in_dev = >> > __in_dev_get_rcu(nexthop_nh->nh_dev); >> > - if (nexthop_nh->nh_flags & RTNH_F_DEAD) >> > - continue; >> > - if (in_dev && >> > - >> > IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && >> > - nexthop_nh->nh_flags & RTNH_F_LINKDOWN) >> > - continue; >> > - power += nexthop_nh->nh_weight; >> > - nexthop_nh->nh_power = >> > nexthop_nh->nh_weight; >> > - } endfor_nexthops(fi); >> > - fi->fib_power = power; >> > - if (power <= 0) { >> > - spin_unlock_bh(&fib_multipath_lock); >> > - /* Race condition: route has just become >> > dead. */ >> > - res->nh_sel = 0; >> > - return; >> > - } >> > - } >> > - >> > >> > - /* w should be random number [0..fi->fib_power-1], >> > - * it is pretty bad approximation. >> > - */ >> > - >> > - w = jiffies % fi->fib_power; >> > + for_nexthops(fi) { >> > + if (hash > atomic_read(&nh->nh_upper_bound)) >> > + continue; >> > >> > - change_nexthops(fi) { >> > - if (!(nexthop_nh->nh_flags & RTNH_F_DEAD) && >> > - nexthop_nh->nh_power) { >> > - w -= nexthop_nh->nh_power; >> > - if (w <= 0) { >> > - nexthop_nh->nh_power--; >> > - fi->fib_power--; >> > - res->nh_sel = nhsel; >> > - spin_unlock_bh(&fib_multipath_lock); >> > - return; >> > - } >> > - } >> > + res->nh_sel = nhsel; >> > + return; >> > } endfor_nexthops(fi); >> > >> > /* Race condition: route has just become dead. */ >> > res->nh_sel = 0; >> > - spin_unlock_bh(&fib_multipath_lock); >> > } >> > #endif >> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c >> > index 80f7c5b..3dc83b6 100644 >> > --- a/net/ipv4/route.c >> > +++ b/net/ipv4/route.c >> > @@ -1659,8 +1659,12 @@ static int ip_mkroute_input(struct sk_buff >> > *skb, __be32 daddr, __be32 saddr, u32 tos) >> > { >> > #ifdef CONFIG_IP_ROUTE_MULTIPATH >> > - if (res->fi && res->fi->fib_nhs > 1) >> > - fib_select_multipath(res); >> > + if (res->fi && res->fi->fib_nhs > 1) { >> > + int h; >> > + >> > + h = fib_multipath_hash(saddr, daddr); >> > + fib_select_multipath(res, h); >> > + } >> > #endif >> > >> > /* create a routing cache entry */ >> > @@ -2190,8 +2194,12 @@ struct rtable *__ip_route_output_key(struct >> > net *net, struct flowi4 *fl4) } >> > >> > #ifdef CONFIG_IP_ROUTE_MULTIPATH >> > - if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) >> > - fib_select_multipath(&res); >> > + if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) { >> > + int h; >> > + >> > + h = fib_multipath_hash(fl4->saddr, fl4->daddr); >> > + fib_select_multipath(&res, h); >> > + } >> > else >> > #endif >> > if (!res.prefixlen && >> > -- >> > 1.7.10.4 >> > >> > -- >> > To unsubscribe from this list: send the line "unsubscribe netdev" in >> > the body of a message to majordomo@vger.kernel.org >> > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath 2015-09-23 23:12 ` Tom Herbert @ 2015-09-23 23:38 ` Peter Nørlund 0 siblings, 0 replies; 16+ messages in thread From: Peter Nørlund @ 2015-09-23 23:38 UTC (permalink / raw) To: Tom Herbert Cc: Linux Kernel Network Developers, David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy On Wed, 23 Sep 2015 16:12:42 -0700 Tom Herbert <tom@herbertland.com> wrote: > On Wed, Sep 23, 2015 at 2:43 PM, Peter Nørlund <pch@ordbogen.com> > wrote: > > On Wed, 23 Sep 2015 14:00:43 -0700 > > Tom Herbert <tom@herbertland.com> wrote: > > > >> On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com> > >> wrote: > >> > Replaces the per-packet multipath with a hash-based multipath > >> > using source and destination address. > >> > > >> It's good that round robin is going away, but this still looks very > >> different with how multipath routing is done done in IPv6 > >> (rt6_multipath_select and rt6_info_hash_nhsfn). For instance IPv4 > >> hashes addresses, but IPv6 includes ports. How can we rectify this? > >> > >> > Signed-off-by: Peter Nørlund <pch@ordbogen.com> > >> > --- > >> > include/net/ip_fib.h | 14 ++++- > >> > net/ipv4/fib_semantics.c | 140 > >> > +++++++++++++++++++++++++--------------------- > >> > net/ipv4/route.c | 16 ++++-- 3 files changed, 98 > >> > insertions(+), 72 deletions(-) > >> > > >> > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h > >> > index a37d043..c454203 100644 > >> > --- a/include/net/ip_fib.h > >> > +++ b/include/net/ip_fib.h > >> > @@ -79,7 +79,7 @@ struct fib_nh { > >> > unsigned char nh_scope; > >> > #ifdef CONFIG_IP_ROUTE_MULTIPATH > >> > int nh_weight; > >> > - int nh_power; > >> > + atomic_t nh_upper_bound; > >> > #endif > >> > #ifdef CONFIG_IP_ROUTE_CLASSID > >> > __u32 nh_tclassid; > >> > @@ -118,7 +118,7 @@ struct fib_info { > >> > #define fib_advmss fib_metrics[RTAX_ADVMSS-1] > >> > int fib_nhs; > >> > #ifdef CONFIG_IP_ROUTE_MULTIPATH > >> > - int fib_power; > >> > + int fib_weight; > >> > #endif > >> > struct rcu_head rcu; > >> > struct fib_nh fib_nh[0]; > >> > @@ -312,7 +312,15 @@ int ip_fib_check_default(__be32 gw, struct > >> > net_device *dev); int fib_sync_down_dev(struct net_device *dev, > >> > unsigned long event); int fib_sync_down_addr(struct net *net, > >> > __be32 local); int fib_sync_up(struct net_device *dev, unsigned > >> > int nh_flags); -void fib_select_multipath(struct fib_result > >> > *res); + > >> > +extern u32 fib_multipath_secret __read_mostly; > >> > + > >> > +static inline int fib_multipath_hash(__be32 saddr, __be32 daddr) > >> > +{ > >> > + return jhash_2words(saddr, daddr, fib_multipath_secret) > >> > >> 1; > >> > >> Why the >> 1? > > > > Each nexthop is assigned a range within the hash key space with the > > upper bound specified in nh_upper_bound, but I needed a way to show > > that a nexthop was dead within that same atomic variable. I chose -1 > > reducing the actual hash to 31 bits. I shift the hash because I > > expect it to be slightly faster than AND'ing with 0x7FFFFFFF on > > most platforms. > > > >> > >> > +} > >> > + > >> > +void fib_select_multipath(struct fib_result *res, int hash); > >> > > >> > /* Exported by fib_trie.c */ > >> > void fib_trie_init(void); > >> > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c > >> > index 064bd3c..0c49d2f 100644 > >> > --- a/net/ipv4/fib_semantics.c > >> > +++ b/net/ipv4/fib_semantics.c > >> > @@ -57,8 +57,7 @@ static unsigned int fib_info_cnt; > >> > static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE]; > >> > > >> > #ifdef CONFIG_IP_ROUTE_MULTIPATH > >> > - > >> > -static DEFINE_SPINLOCK(fib_multipath_lock); > >> > +u32 fib_multipath_secret __read_mostly; > >> > > >> > #define for_nexthops(fi) > >> > { \ int nhsel; const > >> > struct fib_nh *nh; \ @@ -532,7 > >> > +531,67 @@ errout: return ret; > >> > } > >> > > >> > -#endif > >> > +static void fib_rebalance(struct fib_info *fi) > >> > +{ > >> > + int total; > >> > + int w; > >> > + struct in_device *in_dev; > >> > + > >> > + if (fi->fib_nhs < 2) > >> > + return; > >> > + > >> > + total = 0; > >> > + for_nexthops(fi) { > >> > + if (nh->nh_flags & RTNH_F_DEAD) > >> > + continue; > >> > + > >> > + in_dev = __in_dev_get_rcu(nh->nh_dev); > >> > + > >> > + if (in_dev && > >> > + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > >> > + nh->nh_flags & RTNH_F_LINKDOWN) > >> > + continue; > >> > + > >> > + total += nh->nh_weight; > >> > + } endfor_nexthops(fi); > >> > + > >> > + w = 0; > >> > + change_nexthops(fi) { > >> > + int upper_bound; > >> > + > >> > + in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev); > >> > + > >> > + if (nexthop_nh->nh_flags & RTNH_F_DEAD) { > >> > + upper_bound = -1; > >> > + } else if (in_dev && > >> > + > >> > IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > >> > + nexthop_nh->nh_flags & > >> > RTNH_F_LINKDOWN) { > >> > + upper_bound = -1; > >> > + } else { > >> > + w += nexthop_nh->nh_weight; > >> > + upper_bound = > >> > DIV_ROUND_CLOSEST(2147483648LL * w, > >> > + total) - > >> > 1; > >> > + } > >> > + > >> > >> Is this complexity (aside from recomputing the key) needed because > >> you want consistent hashing for the anycast case? > > > > It is an attempt to reduce the disruption when a path is added or > > removed in a way recommended by RFC 2992. It is not specific to > > anycast at all, although disruptions are more severe for anycast > > than for ordinary TCP which might just slow down due to out of > > order packets. > > > RFC2992 looks more like a mathematical proof than a practical > algorithm :-) Anyway, the requirement to avoid OOO with TCP is so > often over blown. TCP does not keel over dead with OOO packets, it > just may experience a blip of being suboptimal. As long as we're not > thrashing paths things should be fine quickly, and if we are thrashing > links rather each change affects 1/2 or 1/4 of connections probably > doesn't matter. IMO no one should complain about just doing simple > modulo on the hash. > > However, if you're saying that adding a removing a path can cause > complete loss of connectivity for 1/2 of connections using anycast > that is a little more worrisome (this is essentially the same problem > we have with SO_REUSEPORT and TCP). In that case seems like you'd want > the 1/N solution. > > Tom > I guess it depends on how you use anycast. If you use TCP and each path would hit different servers which doesn't share state at all, things would break. If you're having a pair of load-balancers behind the router which synchronizes with each other (like IPVS), then only the very new connections would fail (the connections which haven't been synchronized yet). If you only have two paths, then you can't get any better than 1/2 no matter what algorithm you choose (all flows on the bad path will have to move, which may cause issues depending on the setup). It is actually still better than IPVS in an active/backup scenario, since a crashing load balancer might cause issues for ALL flows. I guess we are getting closer and closer to L4 with modulo after all - at least it is simpler. Best regards, Peter Nørlund > > Traditionally people used modulo to calculate the path from the > > hash, but with the modulo approach 1/2 of all flows would change > > paths when a path is added or removed. With the hash threshold > > approach, somewhere between 1/4 and 1/2 of all flows are affected. > > So an improvement for routes with more than two paths. > > > > Ideally you would have a hash-to-port mapping table or similar, > > ensuring that only 1/N of all flows were affected (with N being the > > number of paths). > > > >> > >> > + atomic_set(&nexthop_nh->nh_upper_bound, > >> > upper_bound); > >> > + } endfor_nexthops(fi); > >> > + > >> > + net_get_random_once(&fib_multipath_secret, > >> > + sizeof(fib_multipath_secret)); > >> > +} > >> > + > >> > +static inline void fib_add_weight(struct fib_info *fi, > >> > + const struct fib_nh *nh) > >> > +{ > >> > + fi->fib_weight += nh->nh_weight; > >> > +} > >> > + > >> > +#else /* CONFIG_IP_ROUTE_MULTIPATH */ > >> > + > >> > +#define fib_rebalance(fi) do { } while (0) > >> > +#define fib_add_weight(fi, nh) do { } while (0) > >> > + > >> > +#endif /* CONFIG_IP_ROUTE_MULTIPATH */ > >> > > >> > static int fib_encap_match(struct net *net, u16 encap_type, > >> > struct nlattr *encap, > >> > @@ -1094,8 +1153,11 @@ struct fib_info *fib_create_info(struct > >> > fib_config *cfg) > >> > > >> > change_nexthops(fi) { > >> > fib_info_update_nh_saddr(net, nexthop_nh); > >> > + fib_add_weight(fi, nexthop_nh); > >> > } endfor_nexthops(fi) > >> > > >> > + fib_rebalance(fi); > >> > + > >> > link_it: > >> > ofi = fib_find_info(fi); > >> > if (ofi) { > >> > @@ -1317,12 +1379,6 @@ int fib_sync_down_dev(struct net_device > >> > *dev, unsigned long event) nexthop_nh->nh_flags |= > >> > RTNH_F_LINKDOWN; break; > >> > } > >> > -#ifdef CONFIG_IP_ROUTE_MULTIPATH > >> > - > >> > spin_lock_bh(&fib_multipath_lock); > >> > - fi->fib_power -= > >> > nexthop_nh->nh_power; > >> > - nexthop_nh->nh_power = 0; > >> > - > >> > spin_unlock_bh(&fib_multipath_lock); -#endif > >> > dead++; > >> > } > >> > #ifdef CONFIG_IP_ROUTE_MULTIPATH > >> > @@ -1345,6 +1401,8 @@ int fib_sync_down_dev(struct net_device > >> > *dev, unsigned long event) } > >> > ret++; > >> > } > >> > + > >> > + fib_rebalance(fi); > >> > } > >> > > >> > return ret; > >> > @@ -1467,20 +1525,15 @@ int fib_sync_up(struct net_device *dev, > >> > unsigned int nh_flags) !__in_dev_get_rtnl(dev)) > >> > continue; > >> > alive++; > >> > -#ifdef CONFIG_IP_ROUTE_MULTIPATH > >> > - spin_lock_bh(&fib_multipath_lock); > >> > - nexthop_nh->nh_power = 0; > >> > - nexthop_nh->nh_flags &= ~nh_flags; > >> > - spin_unlock_bh(&fib_multipath_lock); > >> > -#else > >> > nexthop_nh->nh_flags &= ~nh_flags; > >> > -#endif > >> > } endfor_nexthops(fi) > >> > > >> > if (alive > 0) { > >> > fi->fib_flags &= ~nh_flags; > >> > ret++; > >> > } > >> > + > >> > + fib_rebalance(fi); > >> > } > >> > > >> > return ret; > >> > @@ -1488,62 +1541,19 @@ int fib_sync_up(struct net_device *dev, > >> > unsigned int nh_flags) > >> > > >> > #ifdef CONFIG_IP_ROUTE_MULTIPATH > >> > > >> > -/* > >> > - * The algorithm is suboptimal, but it provides really > >> > - * fair weighted route distribution. > >> > - */ > >> > -void fib_select_multipath(struct fib_result *res) > >> > +void fib_select_multipath(struct fib_result *res, int hash) > >> > { > >> > struct fib_info *fi = res->fi; > >> > - struct in_device *in_dev; > >> > - int w; > >> > - > >> > - spin_lock_bh(&fib_multipath_lock); > >> > - if (fi->fib_power <= 0) { > >> > - int power = 0; > >> > - change_nexthops(fi) { > >> > - in_dev = > >> > __in_dev_get_rcu(nexthop_nh->nh_dev); > >> > - if (nexthop_nh->nh_flags & RTNH_F_DEAD) > >> > - continue; > >> > - if (in_dev && > >> > - > >> > IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) && > >> > - nexthop_nh->nh_flags & > >> > RTNH_F_LINKDOWN) > >> > - continue; > >> > - power += nexthop_nh->nh_weight; > >> > - nexthop_nh->nh_power = > >> > nexthop_nh->nh_weight; > >> > - } endfor_nexthops(fi); > >> > - fi->fib_power = power; > >> > - if (power <= 0) { > >> > - spin_unlock_bh(&fib_multipath_lock); > >> > - /* Race condition: route has just become > >> > dead. */ > >> > - res->nh_sel = 0; > >> > - return; > >> > - } > >> > - } > >> > - > >> > > >> > - /* w should be random number [0..fi->fib_power-1], > >> > - * it is pretty bad approximation. > >> > - */ > >> > - > >> > - w = jiffies % fi->fib_power; > >> > + for_nexthops(fi) { > >> > + if (hash > atomic_read(&nh->nh_upper_bound)) > >> > + continue; > >> > > >> > - change_nexthops(fi) { > >> > - if (!(nexthop_nh->nh_flags & RTNH_F_DEAD) && > >> > - nexthop_nh->nh_power) { > >> > - w -= nexthop_nh->nh_power; > >> > - if (w <= 0) { > >> > - nexthop_nh->nh_power--; > >> > - fi->fib_power--; > >> > - res->nh_sel = nhsel; > >> > - > >> > spin_unlock_bh(&fib_multipath_lock); > >> > - return; > >> > - } > >> > - } > >> > + res->nh_sel = nhsel; > >> > + return; > >> > } endfor_nexthops(fi); > >> > > >> > /* Race condition: route has just become dead. */ > >> > res->nh_sel = 0; > >> > - spin_unlock_bh(&fib_multipath_lock); > >> > } > >> > #endif > >> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > >> > index 80f7c5b..3dc83b6 100644 > >> > --- a/net/ipv4/route.c > >> > +++ b/net/ipv4/route.c > >> > @@ -1659,8 +1659,12 @@ static int ip_mkroute_input(struct sk_buff > >> > *skb, __be32 daddr, __be32 saddr, u32 tos) > >> > { > >> > #ifdef CONFIG_IP_ROUTE_MULTIPATH > >> > - if (res->fi && res->fi->fib_nhs > 1) > >> > - fib_select_multipath(res); > >> > + if (res->fi && res->fi->fib_nhs > 1) { > >> > + int h; > >> > + > >> > + h = fib_multipath_hash(saddr, daddr); > >> > + fib_select_multipath(res, h); > >> > + } > >> > #endif > >> > > >> > /* create a routing cache entry */ > >> > @@ -2190,8 +2194,12 @@ struct rtable > >> > *__ip_route_output_key(struct net *net, struct flowi4 *fl4) } > >> > > >> > #ifdef CONFIG_IP_ROUTE_MULTIPATH > >> > - if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) > >> > - fib_select_multipath(&res); > >> > + if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) { > >> > + int h; > >> > + > >> > + h = fib_multipath_hash(fl4->saddr, fl4->daddr); > >> > + fib_select_multipath(&res, h); > >> > + } > >> > else > >> > #endif > >> > if (!res.prefixlen && > >> > -- > >> > 1.7.10.4 > >> > > >> > -- > >> > To unsubscribe from this list: send the line "unsubscribe > >> > netdev" in the body of a message to majordomo@vger.kernel.org > >> > More majordomo info at > >> > http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath 2015-09-23 21:00 ` Tom Herbert 2015-09-23 21:43 ` Peter Nørlund @ 2015-09-23 23:09 ` Peter Nørlund 2015-09-23 23:37 ` Tom Herbert 1 sibling, 1 reply; 16+ messages in thread From: Peter Nørlund @ 2015-09-23 23:09 UTC (permalink / raw) To: Tom Herbert Cc: Linux Kernel Network Developers, David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy On Wed, 23 Sep 2015 14:00:43 -0700 Tom Herbert <tom@herbertland.com> wrote: > On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com> > wrote: > > Replaces the per-packet multipath with a hash-based multipath using > > source and destination address. > > > It's good that round robin is going away, but this still looks very > different with how multipath routing is done done in IPv6 > (rt6_multipath_select and rt6_info_hash_nhsfn). For instance IPv4 > hashes addresses, but IPv6 includes ports. How can we rectify this? > I may be wrong, since I haven't delved that much into the IPv6 code, but rt6_multipath_select is nice and clean because it doesn't have to burden with different weights of the paths. As for not including the ports, it is for the sole purpose of not disruption the flow when fragmented packets are received. This is more likely with IPv4 than with IPv6, since PMTUD is optional with IPv4. In an ideal world, the IPv6 code shouldn't look at anything but addresses and flow label either, based on the principle that the router shouldn't care about L4 and above (but then it shouldn't look at ICMP either, heh) - but I know this isn't an ideal world and I have no operational experience with IPv6, so I can't tell whether clients populate the flow label properly. Ḯ would argue that L3-based hashing is more than sufficient for most websites and ISPs, where the number of addresses is high. At least on the network I have access to, L4 gave very little extra (3%). But I knew linux users would be demanding L4 hashing despite my beliefs, and there would probably even be people missing the per-packet multipath. This is why I started out reintroducing the RTA_MP_ALGO attribute in my original patch. To be honest, L4 might almost work in my network which hosts a few relatively large Danish websites. Fragmentation is only a problem on clients not doing PMTU (~10%) having large HTTP cookies (very few). But to these people, they'll have a 50% chance of not being able to access our sites at all, because packets are distributed to load balancers which have not been updated with the connection state yet. My goal is to create the right solution, and to me the right solution is a solution which doesn't break anything whatsoever. It doesn't cause out-of-order packets or lost packets just to utilize some links better. ECMP, Link Aggregation, anycast and load balancers are all hacks, if you ask me - and these hacks must be careful to not destroy the illusion that an IP address maps to a single host and the path to that host is through one cable. If you all disagree, I'll change it - no problem. Just about anything is better than the per-packet solution. But I'll have to consider whether we will be running a modified version of the multipath code in my network. Best regards, Peter Nørlund ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath 2015-09-23 23:09 ` Peter Nørlund @ 2015-09-23 23:37 ` Tom Herbert 0 siblings, 0 replies; 16+ messages in thread From: Tom Herbert @ 2015-09-23 23:37 UTC (permalink / raw) To: Peter Nørlund Cc: Linux Kernel Network Developers, David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy On Wed, Sep 23, 2015 at 4:09 PM, Peter Nørlund <pch@ordbogen.com> wrote: > On Wed, 23 Sep 2015 14:00:43 -0700 > Tom Herbert <tom@herbertland.com> wrote: > >> On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com> >> wrote: >> > Replaces the per-packet multipath with a hash-based multipath using >> > source and destination address. >> > >> It's good that round robin is going away, but this still looks very >> different with how multipath routing is done done in IPv6 >> (rt6_multipath_select and rt6_info_hash_nhsfn). For instance IPv4 >> hashes addresses, but IPv6 includes ports. How can we rectify this? >> > > I may be wrong, since I haven't delved that much into the IPv6 code, but > rt6_multipath_select is nice and clean because it doesn't have to burden > with different weights of the paths. > > As for not including the ports, it is for the sole purpose of not > disruption the flow when fragmented packets are received. This is more > likely with IPv4 than with IPv6, since PMTUD is optional with IPv4. In > an ideal world, the IPv6 code shouldn't look at anything but addresses > and flow label either, based on the principle that the router shouldn't > care about L4 and above (but then it shouldn't look at ICMP either, heh) > - but I know this isn't an ideal world and I have no operational > experience with IPv6, so I can't tell whether clients populate the flow > label properly. > > Ḯ would argue that L3-based hashing is more than sufficient for > most websites and ISPs, where the number of addresses is high. At least > on the network I have access to, L4 gave very little extra (3%). But I > knew linux users would be demanding L4 hashing despite my beliefs, and > there would probably even be people missing the per-packet multipath. > This is why I started out reintroducing the RTA_MP_ALGO attribute in > my original patch. > L4 versus L3 hashing is not my primary concern. It is the glaring inconsistency between IPv4 and IPv6. If we have fundamentally different behaviors between these versions of IP this can create (and has created) headaches for users running IPv4 and IPv6 networks (which is now basically the Internet). All of your points for why L3 is better that L4 hashing for IPv4 should apply to IPv6 given current state of flow label support. So not try for a unified solution? Either both should just use L3 hashing, or both should allow configurable use of L3 or L4 hashing. Thanks, Tom > To be honest, L4 might almost work in my network which hosts a > few relatively large Danish websites. Fragmentation is only a problem on > clients not doing PMTU (~10%) having large HTTP cookies (very few). But > to these people, they'll have a 50% chance of not being able to access > our sites at all, because packets are distributed to load balancers > which have not been updated with the connection state yet. > > My goal is to create the right solution, and to me the right solution > is a solution which doesn't break anything whatsoever. It doesn't cause > out-of-order packets or lost packets just to utilize some links > better. ECMP, Link Aggregation, anycast and load balancers are all > hacks, if you ask me - and these hacks must be careful to not destroy > the illusion that an IP address maps to a single host and the path to > that host is through one cable. > > If you all disagree, I'll change it - no problem. Just about anything > is better than the per-packet solution. But I'll have to consider > whether we will be running a modified version of the multipath code in > my network. > > Best regards, > Peter Nørlund ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath 2015-09-23 19:49 ` [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath Peter Nørlund 2015-09-23 21:00 ` Tom Herbert @ 2015-09-23 21:55 ` David Miller 1 sibling, 0 replies; 16+ messages in thread From: David Miller @ 2015-09-23 21:55 UTC (permalink / raw) To: pch; +Cc: netdev, kuznet, jmorris, yoshfuji, kaber From: Peter Nørlund <pch@ordbogen.com> Date: Wed, 23 Sep 2015 21:49:36 +0200 > +static inline int fib_multipath_hash(__be32 saddr, __be32 daddr) > +{ > + return jhash_2words(saddr, daddr, fib_multipath_secret) >> 1; > +} Like Tom I'm curious why you do this shift here? ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v4 net-next 2/2] ipv4: ICMP packet inspection for multipath 2015-09-23 19:49 [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing Peter Nørlund 2015-09-23 19:49 ` [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath Peter Nørlund @ 2015-09-23 19:49 ` Peter Nørlund 2015-09-23 21:01 ` Tom Herbert 2015-09-29 2:33 ` [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing David Miller 2 siblings, 1 reply; 16+ messages in thread From: Peter Nørlund @ 2015-09-23 19:49 UTC (permalink / raw) To: netdev Cc: David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Peter Nørlund ICMP packets are inspected to let them route together with the flow they belong to, minimizing the chance that a problematic path will affect flows on other paths, and so that anycast environments can work with ECMP. Signed-off-by: Peter Nørlund <pch@ordbogen.com> --- include/net/route.h | 11 +++++++++- net/ipv4/icmp.c | 19 ++++++++++++++++- net/ipv4/route.c | 57 +++++++++++++++++++++++++++++++++++++++++++++------ 3 files changed, 79 insertions(+), 8 deletions(-) diff --git a/include/net/route.h b/include/net/route.h index 10a7d21..0f7abdc 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -28,6 +28,7 @@ #include <net/inetpeer.h> #include <net/flow.h> #include <net/inet_sock.h> +#include <net/ip_fib.h> #include <linux/in_route.h> #include <linux/rtnetlink.h> #include <linux/rcupdate.h> @@ -112,7 +113,15 @@ struct in_device; int ip_rt_init(void); void rt_cache_flush(struct net *net); void rt_flush_dev(struct net_device *dev); -struct rtable *__ip_route_output_key(struct net *, struct flowi4 *flp); +struct rtable *__ip_route_output_key_hash(struct net *, struct flowi4 *flp, + int mp_hash); + +static inline struct rtable *__ip_route_output_key(struct net *net, + struct flowi4 *flp) +{ + return __ip_route_output_key_hash(net, flp, -1); +} + struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp, struct sock *sk); struct dst_entry *ipv4_blackhole_route(struct net *net, diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index 79fe05b..f81daca 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -440,6 +440,22 @@ out_unlock: icmp_xmit_unlock(sk); } +#ifdef CONFIG_IP_ROUTE_MULTIPATH + +/* Source and destination is swapped. See ip_multipath_icmp_hash */ +static int icmp_multipath_hash_skb(const struct sk_buff *skb) +{ + const struct iphdr *iph = ip_hdr(skb); + + return fib_multipath_hash(iph->daddr, iph->saddr); +} + +#else + +#define icmp_multipath_hash_skb(skb) (-1) + +#endif + static struct rtable *icmp_route_lookup(struct net *net, struct flowi4 *fl4, struct sk_buff *skb_in, @@ -464,7 +480,8 @@ static struct rtable *icmp_route_lookup(struct net *net, fl4->flowi4_oif = vrf_master_ifindex(skb_in->dev) ? : skb_in->dev->ifindex; security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4)); - rt = __ip_route_output_key(net, fl4); + rt = __ip_route_output_key_hash(net, fl4, + icmp_multipath_hash_skb(skb_in)); if (IS_ERR(rt)) return rt; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 3dc83b6..29048e1 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1652,6 +1652,48 @@ out: return err; } +#ifdef CONFIG_IP_ROUTE_MULTIPATH + +/* To make ICMP packets follow the right flow, the multipath hash is + * calculated from the inner IP addresses in reverse order. + */ +static int ip_multipath_icmp_hash(struct sk_buff *skb) +{ + const struct iphdr *outer_iph = ip_hdr(skb); + struct icmphdr _icmph; + const struct icmphdr *icmph; + struct iphdr _inner_iph; + const struct iphdr *inner_iph; + + if (unlikely((outer_iph->frag_off & htons(IP_OFFSET)) != 0)) + goto standard_hash; + + icmph = skb_header_pointer(skb, outer_iph->ihl * 4, sizeof(_icmph), + &_icmph); + if (!icmph) + goto standard_hash; + + if (icmph->type != ICMP_DEST_UNREACH && + icmph->type != ICMP_REDIRECT && + icmph->type != ICMP_TIME_EXCEEDED && + icmph->type != ICMP_PARAMETERPROB) { + goto standard_hash; + } + + inner_iph = skb_header_pointer(skb, + outer_iph->ihl * 4 + sizeof(_icmph), + sizeof(_inner_iph), &_inner_iph); + if (!inner_iph) + goto standard_hash; + + return fib_multipath_hash(inner_iph->daddr, inner_iph->saddr); + +standard_hash: + return fib_multipath_hash(outer_iph->saddr, outer_iph->daddr); +} + +#endif /* CONFIG_IP_ROUTE_MULTIPATH */ + static int ip_mkroute_input(struct sk_buff *skb, struct fib_result *res, const struct flowi4 *fl4, @@ -1662,7 +1704,10 @@ static int ip_mkroute_input(struct sk_buff *skb, if (res->fi && res->fi->fib_nhs > 1) { int h; - h = fib_multipath_hash(saddr, daddr); + if (unlikely(ip_hdr(skb)->protocol == IPPROTO_ICMP)) + h = ip_multipath_icmp_hash(skb); + else + h = fib_multipath_hash(saddr, daddr); fib_select_multipath(res, h); } #endif @@ -2032,7 +2077,8 @@ add: * Major route resolver routine. */ -struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) +struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, + int mp_hash) { struct net_device *dev_out = NULL; __u8 tos = RT_FL_TOS(fl4); @@ -2195,10 +2241,9 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) #ifdef CONFIG_IP_ROUTE_MULTIPATH if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) { - int h; - - h = fib_multipath_hash(fl4->saddr, fl4->daddr); - fib_select_multipath(&res, h); + if (mp_hash < 0) + mp_hash = fib_multipath_hash(fl4->saddr, fl4->daddr); + fib_select_multipath(&res, mp_hash); } else #endif -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v4 net-next 2/2] ipv4: ICMP packet inspection for multipath 2015-09-23 19:49 ` [PATCH v4 net-next 2/2] ipv4: ICMP packet inspection for multipath Peter Nørlund @ 2015-09-23 21:01 ` Tom Herbert 2015-09-23 22:07 ` Peter Nørlund 0 siblings, 1 reply; 16+ messages in thread From: Tom Herbert @ 2015-09-23 21:01 UTC (permalink / raw) To: Peter Nørlund Cc: Linux Kernel Network Developers, David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com> wrote: > ICMP packets are inspected to let them route together with the flow they > belong to, minimizing the chance that a problematic path will affect flows > on other paths, and so that anycast environments can work with ECMP. > > Signed-off-by: Peter Nørlund <pch@ordbogen.com> > --- > include/net/route.h | 11 +++++++++- > net/ipv4/icmp.c | 19 ++++++++++++++++- > net/ipv4/route.c | 57 +++++++++++++++++++++++++++++++++++++++++++++------ > 3 files changed, 79 insertions(+), 8 deletions(-) > > diff --git a/include/net/route.h b/include/net/route.h > index 10a7d21..0f7abdc 100644 > --- a/include/net/route.h > +++ b/include/net/route.h > @@ -28,6 +28,7 @@ > #include <net/inetpeer.h> > #include <net/flow.h> > #include <net/inet_sock.h> > +#include <net/ip_fib.h> > #include <linux/in_route.h> > #include <linux/rtnetlink.h> > #include <linux/rcupdate.h> > @@ -112,7 +113,15 @@ struct in_device; > int ip_rt_init(void); > void rt_cache_flush(struct net *net); > void rt_flush_dev(struct net_device *dev); > -struct rtable *__ip_route_output_key(struct net *, struct flowi4 *flp); > +struct rtable *__ip_route_output_key_hash(struct net *, struct flowi4 *flp, > + int mp_hash); > + > +static inline struct rtable *__ip_route_output_key(struct net *net, > + struct flowi4 *flp) > +{ > + return __ip_route_output_key_hash(net, flp, -1); > +} > + > struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp, > struct sock *sk); > struct dst_entry *ipv4_blackhole_route(struct net *net, > diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c > index 79fe05b..f81daca 100644 > --- a/net/ipv4/icmp.c > +++ b/net/ipv4/icmp.c > @@ -440,6 +440,22 @@ out_unlock: > icmp_xmit_unlock(sk); > } > > +#ifdef CONFIG_IP_ROUTE_MULTIPATH > + > +/* Source and destination is swapped. See ip_multipath_icmp_hash */ > +static int icmp_multipath_hash_skb(const struct sk_buff *skb) > +{ > + const struct iphdr *iph = ip_hdr(skb); > + > + return fib_multipath_hash(iph->daddr, iph->saddr); > +} > + > +#else > + > +#define icmp_multipath_hash_skb(skb) (-1) > + > +#endif > + > static struct rtable *icmp_route_lookup(struct net *net, > struct flowi4 *fl4, > struct sk_buff *skb_in, > @@ -464,7 +480,8 @@ static struct rtable *icmp_route_lookup(struct net *net, > fl4->flowi4_oif = vrf_master_ifindex(skb_in->dev) ? : skb_in->dev->ifindex; > > security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4)); > - rt = __ip_route_output_key(net, fl4); > + rt = __ip_route_output_key_hash(net, fl4, > + icmp_multipath_hash_skb(skb_in)); > if (IS_ERR(rt)) > return rt; > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > index 3dc83b6..29048e1 100644 > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -1652,6 +1652,48 @@ out: > return err; > } > > +#ifdef CONFIG_IP_ROUTE_MULTIPATH > + > +/* To make ICMP packets follow the right flow, the multipath hash is > + * calculated from the inner IP addresses in reverse order. > + */ > +static int ip_multipath_icmp_hash(struct sk_buff *skb) > +{ This seems like overkill to me. Even if you do this on the host no device in the network is going to crack open ICMP like this for doing its ECMP. I suppose the argument is that this is another hack needed to make anycast "work"... > + const struct iphdr *outer_iph = ip_hdr(skb); > + struct icmphdr _icmph; > + const struct icmphdr *icmph; > + struct iphdr _inner_iph; > + const struct iphdr *inner_iph; > + > + if (unlikely((outer_iph->frag_off & htons(IP_OFFSET)) != 0)) > + goto standard_hash; > + > + icmph = skb_header_pointer(skb, outer_iph->ihl * 4, sizeof(_icmph), > + &_icmph); > + if (!icmph) > + goto standard_hash; > + > + if (icmph->type != ICMP_DEST_UNREACH && > + icmph->type != ICMP_REDIRECT && > + icmph->type != ICMP_TIME_EXCEEDED && > + icmph->type != ICMP_PARAMETERPROB) { > + goto standard_hash; > + } > + > + inner_iph = skb_header_pointer(skb, > + outer_iph->ihl * 4 + sizeof(_icmph), > + sizeof(_inner_iph), &_inner_iph); > + if (!inner_iph) > + goto standard_hash; > + > + return fib_multipath_hash(inner_iph->daddr, inner_iph->saddr); > + > +standard_hash: > + return fib_multipath_hash(outer_iph->saddr, outer_iph->daddr); > +} > + > +#endif /* CONFIG_IP_ROUTE_MULTIPATH */ > + > static int ip_mkroute_input(struct sk_buff *skb, > struct fib_result *res, > const struct flowi4 *fl4, > @@ -1662,7 +1704,10 @@ static int ip_mkroute_input(struct sk_buff *skb, > if (res->fi && res->fi->fib_nhs > 1) { > int h; > > - h = fib_multipath_hash(saddr, daddr); > + if (unlikely(ip_hdr(skb)->protocol == IPPROTO_ICMP)) > + h = ip_multipath_icmp_hash(skb); > + else > + h = fib_multipath_hash(saddr, daddr); > fib_select_multipath(res, h); > } > #endif > @@ -2032,7 +2077,8 @@ add: > * Major route resolver routine. > */ > > -struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) > +struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, > + int mp_hash) > { > struct net_device *dev_out = NULL; > __u8 tos = RT_FL_TOS(fl4); > @@ -2195,10 +2241,9 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) { > - int h; > - > - h = fib_multipath_hash(fl4->saddr, fl4->daddr); > - fib_select_multipath(&res, h); > + if (mp_hash < 0) > + mp_hash = fib_multipath_hash(fl4->saddr, fl4->daddr); > + fib_select_multipath(&res, mp_hash); > } > else > #endif > -- > 1.7.10.4 > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 net-next 2/2] ipv4: ICMP packet inspection for multipath 2015-09-23 21:01 ` Tom Herbert @ 2015-09-23 22:07 ` Peter Nørlund 0 siblings, 0 replies; 16+ messages in thread From: Peter Nørlund @ 2015-09-23 22:07 UTC (permalink / raw) To: Tom Herbert Cc: Linux Kernel Network Developers, David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy On Wed, 23 Sep 2015 14:01:23 -0700 Tom Herbert <tom@herbertland.com> wrote: > On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com> > wrote: > > ICMP packets are inspected to let them route together with the flow > > they belong to, minimizing the chance that a problematic path will > > affect flows on other paths, and so that anycast environments can > > work with ECMP. > > > > Signed-off-by: Peter Nørlund <pch@ordbogen.com> > > --- > > include/net/route.h | 11 +++++++++- > > net/ipv4/icmp.c | 19 ++++++++++++++++- > > net/ipv4/route.c | 57 > > +++++++++++++++++++++++++++++++++++++++++++++------ 3 files > > changed, 79 insertions(+), 8 deletions(-) > > > > diff --git a/include/net/route.h b/include/net/route.h > > index 10a7d21..0f7abdc 100644 > > --- a/include/net/route.h > > +++ b/include/net/route.h > > @@ -28,6 +28,7 @@ > > #include <net/inetpeer.h> > > #include <net/flow.h> > > #include <net/inet_sock.h> > > +#include <net/ip_fib.h> > > #include <linux/in_route.h> > > #include <linux/rtnetlink.h> > > #include <linux/rcupdate.h> > > @@ -112,7 +113,15 @@ struct in_device; > > int ip_rt_init(void); > > void rt_cache_flush(struct net *net); > > void rt_flush_dev(struct net_device *dev); > > -struct rtable *__ip_route_output_key(struct net *, struct flowi4 > > *flp); +struct rtable *__ip_route_output_key_hash(struct net *, > > struct flowi4 *flp, > > + int mp_hash); > > + > > +static inline struct rtable *__ip_route_output_key(struct net *net, > > + struct flowi4 > > *flp) +{ > > + return __ip_route_output_key_hash(net, flp, -1); > > +} > > + > > struct rtable *ip_route_output_flow(struct net *, struct flowi4 > > *flp, struct sock *sk); > > struct dst_entry *ipv4_blackhole_route(struct net *net, > > diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c > > index 79fe05b..f81daca 100644 > > --- a/net/ipv4/icmp.c > > +++ b/net/ipv4/icmp.c > > @@ -440,6 +440,22 @@ out_unlock: > > icmp_xmit_unlock(sk); > > } > > > > +#ifdef CONFIG_IP_ROUTE_MULTIPATH > > + > > +/* Source and destination is swapped. See ip_multipath_icmp_hash */ > > +static int icmp_multipath_hash_skb(const struct sk_buff *skb) > > +{ > > + const struct iphdr *iph = ip_hdr(skb); > > + > > + return fib_multipath_hash(iph->daddr, iph->saddr); > > +} > > + > > +#else > > + > > +#define icmp_multipath_hash_skb(skb) (-1) > > + > > +#endif > > + > > static struct rtable *icmp_route_lookup(struct net *net, > > struct flowi4 *fl4, > > struct sk_buff *skb_in, > > @@ -464,7 +480,8 @@ static struct rtable *icmp_route_lookup(struct > > net *net, fl4->flowi4_oif = vrf_master_ifindex(skb_in->dev) ? : > > skb_in->dev->ifindex; > > > > security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4)); > > - rt = __ip_route_output_key(net, fl4); > > + rt = __ip_route_output_key_hash(net, fl4, > > + > > icmp_multipath_hash_skb(skb_in)); if (IS_ERR(rt)) > > return rt; > > > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > > index 3dc83b6..29048e1 100644 > > --- a/net/ipv4/route.c > > +++ b/net/ipv4/route.c > > @@ -1652,6 +1652,48 @@ out: > > return err; > > } > > > > +#ifdef CONFIG_IP_ROUTE_MULTIPATH > > + > > +/* To make ICMP packets follow the right flow, the multipath hash > > is > > + * calculated from the inner IP addresses in reverse order. > > + */ > > +static int ip_multipath_icmp_hash(struct sk_buff *skb) > > +{ > > This seems like overkill to me. Even if you do this on the host no > device in the network is going to crack open ICMP like this for doing > its ECMP. I suppose the argument is that this is another hack needed > to make anycast "work"... True, this is only _really_ needed for anycast setups, although with IPv4 it is solvable by disabling PMTU on the anycast machines, or using Packetization Layer Path MTU Discovery. IPv6 is actually more impacted by it (There is an IETF draft in the making, draft-ietf-v6ops-pmtud-ecmp-problem-04). I fully understand if the final verdict on the ICMP-handling is a loud NO!, but my rationale for trying anyway is as follows: 1. The before-mentioned IETF draft actually recommends the behavior if possible. 2. Non-anycast setups may also benefit from it by truly isolating which flows are affected by which paths. If one path was experiencing packet loss or delays or whatever, it could affect the ICMP errors packets of the flows on the other paths. 3. I did consider making it a sysctl thing, but having it always on doesn't break anything for anybody, yet it solves something for a few. 4. The alternative solutions I've seen today involves running a user space daemon which receives the ICMP packets through NFLOG and then broadcasts the packets to all paths. Compared to that, this if far more efficient. > > > + const struct iphdr *outer_iph = ip_hdr(skb); > > + struct icmphdr _icmph; > > + const struct icmphdr *icmph; > > + struct iphdr _inner_iph; > > + const struct iphdr *inner_iph; > > + > > + if (unlikely((outer_iph->frag_off & htons(IP_OFFSET)) != 0)) > > + goto standard_hash; > > + > > + icmph = skb_header_pointer(skb, outer_iph->ihl * 4, > > sizeof(_icmph), > > + &_icmph); > > + if (!icmph) > > + goto standard_hash; > > + > > + if (icmph->type != ICMP_DEST_UNREACH && > > + icmph->type != ICMP_REDIRECT && > > + icmph->type != ICMP_TIME_EXCEEDED && > > + icmph->type != ICMP_PARAMETERPROB) { > > + goto standard_hash; > > + } > > + > > + inner_iph = skb_header_pointer(skb, > > + outer_iph->ihl * 4 + > > sizeof(_icmph), > > + sizeof(_inner_iph), > > &_inner_iph); > > + if (!inner_iph) > > + goto standard_hash; > > + > > + return fib_multipath_hash(inner_iph->daddr, > > inner_iph->saddr); + > > +standard_hash: > > + return fib_multipath_hash(outer_iph->saddr, > > outer_iph->daddr); +} > > + > > +#endif /* CONFIG_IP_ROUTE_MULTIPATH */ > > + > > static int ip_mkroute_input(struct sk_buff *skb, > > struct fib_result *res, > > const struct flowi4 *fl4, > > @@ -1662,7 +1704,10 @@ static int ip_mkroute_input(struct sk_buff > > *skb, if (res->fi && res->fi->fib_nhs > 1) { > > int h; > > > > - h = fib_multipath_hash(saddr, daddr); > > + if (unlikely(ip_hdr(skb)->protocol == IPPROTO_ICMP)) > > + h = ip_multipath_icmp_hash(skb); > > + else > > + h = fib_multipath_hash(saddr, daddr); > > fib_select_multipath(res, h); > > } > > #endif > > @@ -2032,7 +2077,8 @@ add: > > * Major route resolver routine. > > */ > > > > -struct rtable *__ip_route_output_key(struct net *net, struct > > flowi4 *fl4) +struct rtable *__ip_route_output_key_hash(struct net > > *net, struct flowi4 *fl4, > > + int mp_hash) > > { > > struct net_device *dev_out = NULL; > > __u8 tos = RT_FL_TOS(fl4); > > @@ -2195,10 +2241,9 @@ struct rtable *__ip_route_output_key(struct > > net *net, struct flowi4 *fl4) > > > > #ifdef CONFIG_IP_ROUTE_MULTIPATH > > if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) { > > - int h; > > - > > - h = fib_multipath_hash(fl4->saddr, fl4->daddr); > > - fib_select_multipath(&res, h); > > + if (mp_hash < 0) > > + mp_hash = fib_multipath_hash(fl4->saddr, > > fl4->daddr); > > + fib_select_multipath(&res, mp_hash); > > } > > else > > #endif > > -- > > 1.7.10.4 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing 2015-09-23 19:49 [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing Peter Nørlund 2015-09-23 19:49 ` [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath Peter Nørlund 2015-09-23 19:49 ` [PATCH v4 net-next 2/2] ipv4: ICMP packet inspection for multipath Peter Nørlund @ 2015-09-29 2:33 ` David Miller 2015-09-29 2:55 ` David Miller 2 siblings, 1 reply; 16+ messages in thread From: David Miller @ 2015-09-29 2:33 UTC (permalink / raw) To: pch; +Cc: netdev, kuznet, jmorris, yoshfuji, kaber From: Peter Nørlund <pch@ordbogen.com> Date: Wed, 23 Sep 2015 21:49:35 +0200 > When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed > from more or less being destination-based into being quasi-random per-packet > scheduling. This increases the risk of out-of-order packets and makes it > impossible to use multipath together with anycast services. > > This patch series replaces the old implementation with flow-based load > balancing based on a hash over the source and destination addresses. This isn't perfect but it's a significant step in the right direction. So I'm going to apply this to net-next now and we can make incremental improvements upon it. Thanks. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing 2015-09-29 2:33 ` [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing David Miller @ 2015-09-29 2:55 ` David Miller 2015-09-29 11:29 ` Peter Nørlund 0 siblings, 1 reply; 16+ messages in thread From: David Miller @ 2015-09-29 2:55 UTC (permalink / raw) To: pch; +Cc: netdev, kuznet, jmorris, yoshfuji, kaber From: David Miller <davem@davemloft.net> Date: Mon, 28 Sep 2015 19:33:55 -0700 (PDT) > From: Peter Nørlund <pch@ordbogen.com> > Date: Wed, 23 Sep 2015 21:49:35 +0200 > >> When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed >> from more or less being destination-based into being quasi-random per-packet >> scheduling. This increases the risk of out-of-order packets and makes it >> impossible to use multipath together with anycast services. >> >> This patch series replaces the old implementation with flow-based load >> balancing based on a hash over the source and destination addresses. > > This isn't perfect but it's a significant step in the right direction. > So I'm going to apply this to net-next now and we can make incremental > improvements upon it. Actually, I had to revert, this doesn't build: [davem@localhost net-next]$ make -s -j8 Setup is 16876 bytes (padded to 16896 bytes). System is 10011 kB CRC 324f2811 Kernel: arch/x86/boot/bzImage is ready (#337) ERROR: "__ip_route_output_key_hash" [net/dccp/dccp_ipv4.ko] undefined! scripts/Makefile.modpost:90: recipe for target '__modpost' failed make[1]: *** [__modpost] Error 1 Makefile:1095: recipe for target 'modules' failed make: *** [modules] Error 2 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing 2015-09-29 2:55 ` David Miller @ 2015-09-29 11:29 ` Peter Nørlund 2015-09-29 12:38 ` David Laight 0 siblings, 1 reply; 16+ messages in thread From: Peter Nørlund @ 2015-09-29 11:29 UTC (permalink / raw) To: David Miller; +Cc: netdev, kuznet, jmorris, yoshfuji, kaber On Mon, 28 Sep 2015 19:55:41 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > From: David Miller <davem@davemloft.net> > Date: Mon, 28 Sep 2015 19:33:55 -0700 (PDT) > > > From: Peter Nørlund <pch@ordbogen.com> > > Date: Wed, 23 Sep 2015 21:49:35 +0200 > > > >> When the routing cache was removed in 3.6, the IPv4 multipath > >> algorithm changed from more or less being destination-based into > >> being quasi-random per-packet scheduling. This increases the risk > >> of out-of-order packets and makes it impossible to use multipath > >> together with anycast services. > >> > >> This patch series replaces the old implementation with flow-based > >> load balancing based on a hash over the source and destination > >> addresses. > > > > This isn't perfect but it's a significant step in the right > > direction. So I'm going to apply this to net-next now and we can > > make incremental improvements upon it. > > Actually, I had to revert, this doesn't build: > > [davem@localhost net-next]$ make -s -j8 > Setup is 16876 bytes (padded to 16896 bytes). > System is 10011 kB > CRC 324f2811 > Kernel: arch/x86/boot/bzImage is ready (#337) > ERROR: "__ip_route_output_key_hash" [net/dccp/dccp_ipv4.ko] undefined! > scripts/Makefile.modpost:90: recipe for target '__modpost' failed > make[1]: *** [__modpost] Error 1 > Makefile:1095: recipe for target 'modules' failed > make: *** [modules] Error 2 Sorry! I forgot to update the EXPORT_SYMBOL_GPL line. In the meantime I've been doing some thinking (and measuring). Considering that the broader goal is to make IPv6 and IPv4 behave as identical as possible, it is probably not such a bad idea to just use the flow dissector + modulo in the IPv4 code too - the patch will be simpler than the current one. I fear the performance impact of the flow dissector though - some of my earlier measurements showed that it was 5-6 times slower than the simple one I used. But maybe it is better to streamline the IPv4/IPv6 multipath first and then improve upon it afterward (make it work, make it right, make it fast). As for using L4 hashing with anycast, CloudFlare apparently does L4 hashing - they could have disabled it, but they didn't. Besides, analysis of my own load balancers showed that only one in every 500,000,000 packets is fragmented. And even if I hit a fragmented packet, it is only a problem if the packet hits the wrong load balancer, and if that load balancer haven't been updated with the state from another load balancer (that is, one of the very first packets). It is still a possible scenario though - especially with large HTTP cookies or file uploads. But apparently it is a common problem that IP fragments gets dropped on the Internet, so I suspect that ECMP+Anycast sites are just part of the pool of problematic sites for people with fragments. I'm still unsettled as to whether the ICMP handling belongs to the kernel or not. The above breakage was in the ICMP-part of the patchset, so judging from that, I guess it wasn't out of the question. But in the "IPv4 and IPv6 should behave identical"-mindset, it probably belongs to a separate, future patchset, adding ICMP handling to both IPv4 and IPv6 - and it is actually more important for IPv6 than IPv4 since PMTUD cannot be disabled. Best Regards, Peter Nørlund ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing 2015-09-29 11:29 ` Peter Nørlund @ 2015-09-29 12:38 ` David Laight 0 siblings, 0 replies; 16+ messages in thread From: David Laight @ 2015-09-29 12:38 UTC (permalink / raw) To: 'Peter Nørlund', David Miller Cc: netdev@vger.kernel.org, kuznet@ms2.inr.ac.ru, jmorris@namei.org, yoshfuji@linux-ipv6.org, kaber@trash.net From: Peter Nørlund > Sent: 29 September 2015 12:29 ... > As for using L4 hashing with anycast, CloudFlare apparently does L4 > hashing - they could have disabled it, but they didn't. Besides, > analysis of my own load balancers showed that only one in every > 500,000,000 packets is fragmented. And even if I hit a fragmented > packet, it is only a problem if the packet hits the wrong load > balancer, and if that load balancer haven't been updated with the state > from another load balancer (that is, one of the very first packets). It > is still a possible scenario though - especially with large HTTP > cookies or file uploads. But apparently it is a common problem that IP > fragments gets dropped on the Internet, so I suspect that ECMP+Anycast > sites are just part of the pool of problematic sites for people with > fragments. Fragmentation is usually more of an issue with UDP than TCP. Some SIP messages can get fragmented... David ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2015-09-29 12:40 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-09-23 19:49 [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing Peter Nørlund 2015-09-23 19:49 ` [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath Peter Nørlund 2015-09-23 21:00 ` Tom Herbert 2015-09-23 21:43 ` Peter Nørlund 2015-09-23 23:12 ` Tom Herbert 2015-09-23 23:38 ` Peter Nørlund 2015-09-23 23:09 ` Peter Nørlund 2015-09-23 23:37 ` Tom Herbert 2015-09-23 21:55 ` David Miller 2015-09-23 19:49 ` [PATCH v4 net-next 2/2] ipv4: ICMP packet inspection for multipath Peter Nørlund 2015-09-23 21:01 ` Tom Herbert 2015-09-23 22:07 ` Peter Nørlund 2015-09-29 2:33 ` [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing David Miller 2015-09-29 2:55 ` David Miller 2015-09-29 11:29 ` Peter Nørlund 2015-09-29 12:38 ` David Laight
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).