netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing
@ 2015-09-23 19:49 Peter Nørlund
  2015-09-23 19:49 ` [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath Peter Nørlund
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Peter Nørlund @ 2015-09-23 19:49 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Peter Nørlund

When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed
from more or less being destination-based into being quasi-random per-packet
scheduling. This increases the risk of out-of-order packets and makes it
impossible to use multipath together with anycast services.

This patch series replaces the old implementation with flow-based load
balancing based on a hash over the source and destination addresses.

Distribution of the hash is done with thresholds as described in RFC 2992.
This reduces the disruption when a path is added/remove when having more than
two paths.

To futher the chance of successful usage in conjuction with anycast, ICMP
error packets are hashed over the inner IP addresses. This ensures that PMTU
will work together with anycast or load-balancers such as IPVS.

Port numbers are not considered since fragments could cause problems with
anycast and IPVS. Relying on the DF-flag for TCP packets is also insufficient,
since ICMP inspection effectively extracts information from the opposite
flow which might have a different state of the DF-flag. This is also why the
RSS hash is not used. These are typically based on the NDIS RSS spec which
mandates TCP support.

Measurements of the additional overhead of a two-path multipath
(p_mkroute_input excl. __mkroute_input) on a Xeon X3550 (4 cores, 2.66GHz):

Original per-packet: ~394 cycles/packet
L3 hash:              ~76 cycles/packet

Changes in v4:
- Functions take hash directly instead of func ptr
- Added inline hash function
- Added dummy macros to minimize ifdefs
- Use upper 31 bits of hash instead of lower

Changes in v3:
- Multipath algorithm is no longer configurable (always L3)
- Added random seed to hash
- Moved ICMP inspection to isolated function
- Ignore source quench packets (deprecated as per RFC 6633)

Changes in v2:
- Replaced 8-bit xor hash with 31-bit jenkins hash
- Don't scale weights (since 31-bit)
- Avoided unnecesary renaming of variables
- Rely on DF-bit instead of fragment offset when checking for fragmentation
- upper_bound is now inclusive to avoid overflow
- Use a callback to postpone extracting flow information until necessary
- Skipped ICMP inspection entirely with L4 hashing
- Handle newly added sysctl ignore_routes_with_linkdown

Best Regards
 Peter Nørlund


Peter Nørlund (2):
      ipv4: L3 hash-based multipath
      ipv4: ICMP packet inspection for multipath

 include/net/ip_fib.h     |   14 ++++-
 include/net/route.h      |   11 +++-
 net/ipv4/fib_semantics.c |  140 ++++++++++++++++++++++--------------------
 net/ipv4/icmp.c          |   19 +++++-
 net/ipv4/route.c         |   63 +++++++++++++++++--
 5 files changed, 172 insertions(+), 75 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath
  2015-09-23 19:49 [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing Peter Nørlund
@ 2015-09-23 19:49 ` Peter Nørlund
  2015-09-23 21:00   ` Tom Herbert
  2015-09-23 21:55   ` David Miller
  2015-09-23 19:49 ` [PATCH v4 net-next 2/2] ipv4: ICMP packet inspection for multipath Peter Nørlund
  2015-09-29  2:33 ` [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing David Miller
  2 siblings, 2 replies; 16+ messages in thread
From: Peter Nørlund @ 2015-09-23 19:49 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Peter Nørlund

Replaces the per-packet multipath with a hash-based multipath using
source and destination address.

Signed-off-by: Peter Nørlund <pch@ordbogen.com>
---
 include/net/ip_fib.h     |   14 ++++-
 net/ipv4/fib_semantics.c |  140 +++++++++++++++++++++++++---------------------
 net/ipv4/route.c         |   16 ++++--
 3 files changed, 98 insertions(+), 72 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index a37d043..c454203 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -79,7 +79,7 @@ struct fib_nh {
 	unsigned char		nh_scope;
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	int			nh_weight;
-	int			nh_power;
+	atomic_t		nh_upper_bound;
 #endif
 #ifdef CONFIG_IP_ROUTE_CLASSID
 	__u32			nh_tclassid;
@@ -118,7 +118,7 @@ struct fib_info {
 #define fib_advmss fib_metrics[RTAX_ADVMSS-1]
 	int			fib_nhs;
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-	int			fib_power;
+	int			fib_weight;
 #endif
 	struct rcu_head		rcu;
 	struct fib_nh		fib_nh[0];
@@ -312,7 +312,15 @@ int ip_fib_check_default(__be32 gw, struct net_device *dev);
 int fib_sync_down_dev(struct net_device *dev, unsigned long event);
 int fib_sync_down_addr(struct net *net, __be32 local);
 int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
-void fib_select_multipath(struct fib_result *res);
+
+extern u32 fib_multipath_secret __read_mostly;
+
+static inline int fib_multipath_hash(__be32 saddr, __be32 daddr)
+{
+	return jhash_2words(saddr, daddr, fib_multipath_secret) >> 1;
+}
+
+void fib_select_multipath(struct fib_result *res, int hash);
 
 /* Exported by fib_trie.c */
 void fib_trie_init(void);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 064bd3c..0c49d2f 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -57,8 +57,7 @@ static unsigned int fib_info_cnt;
 static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE];
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-
-static DEFINE_SPINLOCK(fib_multipath_lock);
+u32 fib_multipath_secret __read_mostly;
 
 #define for_nexthops(fi) {						\
 	int nhsel; const struct fib_nh *nh;				\
@@ -532,7 +531,67 @@ errout:
 	return ret;
 }
 
-#endif
+static void fib_rebalance(struct fib_info *fi)
+{
+	int total;
+	int w;
+	struct in_device *in_dev;
+
+	if (fi->fib_nhs < 2)
+		return;
+
+	total = 0;
+	for_nexthops(fi) {
+		if (nh->nh_flags & RTNH_F_DEAD)
+			continue;
+
+		in_dev = __in_dev_get_rcu(nh->nh_dev);
+
+		if (in_dev &&
+		    IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
+		    nh->nh_flags & RTNH_F_LINKDOWN)
+			continue;
+
+		total += nh->nh_weight;
+	} endfor_nexthops(fi);
+
+	w = 0;
+	change_nexthops(fi) {
+		int upper_bound;
+
+		in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev);
+
+		if (nexthop_nh->nh_flags & RTNH_F_DEAD) {
+			upper_bound = -1;
+		} else if (in_dev &&
+			   IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
+			   nexthop_nh->nh_flags & RTNH_F_LINKDOWN) {
+			upper_bound = -1;
+		} else {
+			w += nexthop_nh->nh_weight;
+			upper_bound = DIV_ROUND_CLOSEST(2147483648LL * w,
+							total) - 1;
+		}
+
+		atomic_set(&nexthop_nh->nh_upper_bound, upper_bound);
+	} endfor_nexthops(fi);
+
+	net_get_random_once(&fib_multipath_secret,
+			    sizeof(fib_multipath_secret));
+}
+
+static inline void fib_add_weight(struct fib_info *fi,
+				  const struct fib_nh *nh)
+{
+	fi->fib_weight += nh->nh_weight;
+}
+
+#else /* CONFIG_IP_ROUTE_MULTIPATH */
+
+#define fib_rebalance(fi) do { } while (0)
+#define fib_add_weight(fi, nh) do { } while (0)
+
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
 
 static int fib_encap_match(struct net *net, u16 encap_type,
 			   struct nlattr *encap,
@@ -1094,8 +1153,11 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 
 	change_nexthops(fi) {
 		fib_info_update_nh_saddr(net, nexthop_nh);
+		fib_add_weight(fi, nexthop_nh);
 	} endfor_nexthops(fi)
 
+	fib_rebalance(fi);
+
 link_it:
 	ofi = fib_find_info(fi);
 	if (ofi) {
@@ -1317,12 +1379,6 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event)
 					nexthop_nh->nh_flags |= RTNH_F_LINKDOWN;
 					break;
 				}
-#ifdef CONFIG_IP_ROUTE_MULTIPATH
-				spin_lock_bh(&fib_multipath_lock);
-				fi->fib_power -= nexthop_nh->nh_power;
-				nexthop_nh->nh_power = 0;
-				spin_unlock_bh(&fib_multipath_lock);
-#endif
 				dead++;
 			}
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
@@ -1345,6 +1401,8 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event)
 			}
 			ret++;
 		}
+
+		fib_rebalance(fi);
 	}
 
 	return ret;
@@ -1467,20 +1525,15 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
 			    !__in_dev_get_rtnl(dev))
 				continue;
 			alive++;
-#ifdef CONFIG_IP_ROUTE_MULTIPATH
-			spin_lock_bh(&fib_multipath_lock);
-			nexthop_nh->nh_power = 0;
-			nexthop_nh->nh_flags &= ~nh_flags;
-			spin_unlock_bh(&fib_multipath_lock);
-#else
 			nexthop_nh->nh_flags &= ~nh_flags;
-#endif
 		} endfor_nexthops(fi)
 
 		if (alive > 0) {
 			fi->fib_flags &= ~nh_flags;
 			ret++;
 		}
+
+		fib_rebalance(fi);
 	}
 
 	return ret;
@@ -1488,62 +1541,19 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 
-/*
- * The algorithm is suboptimal, but it provides really
- * fair weighted route distribution.
- */
-void fib_select_multipath(struct fib_result *res)
+void fib_select_multipath(struct fib_result *res, int hash)
 {
 	struct fib_info *fi = res->fi;
-	struct in_device *in_dev;
-	int w;
-
-	spin_lock_bh(&fib_multipath_lock);
-	if (fi->fib_power <= 0) {
-		int power = 0;
-		change_nexthops(fi) {
-			in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev);
-			if (nexthop_nh->nh_flags & RTNH_F_DEAD)
-				continue;
-			if (in_dev &&
-			    IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
-			    nexthop_nh->nh_flags & RTNH_F_LINKDOWN)
-				continue;
-			power += nexthop_nh->nh_weight;
-			nexthop_nh->nh_power = nexthop_nh->nh_weight;
-		} endfor_nexthops(fi);
-		fi->fib_power = power;
-		if (power <= 0) {
-			spin_unlock_bh(&fib_multipath_lock);
-			/* Race condition: route has just become dead. */
-			res->nh_sel = 0;
-			return;
-		}
-	}
-
 
-	/* w should be random number [0..fi->fib_power-1],
-	 * it is pretty bad approximation.
-	 */
-
-	w = jiffies % fi->fib_power;
+	for_nexthops(fi) {
+		if (hash > atomic_read(&nh->nh_upper_bound))
+			continue;
 
-	change_nexthops(fi) {
-		if (!(nexthop_nh->nh_flags & RTNH_F_DEAD) &&
-		    nexthop_nh->nh_power) {
-			w -= nexthop_nh->nh_power;
-			if (w <= 0) {
-				nexthop_nh->nh_power--;
-				fi->fib_power--;
-				res->nh_sel = nhsel;
-				spin_unlock_bh(&fib_multipath_lock);
-				return;
-			}
-		}
+		res->nh_sel = nhsel;
+		return;
 	} endfor_nexthops(fi);
 
 	/* Race condition: route has just become dead. */
 	res->nh_sel = 0;
-	spin_unlock_bh(&fib_multipath_lock);
 }
 #endif
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 80f7c5b..3dc83b6 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1659,8 +1659,12 @@ static int ip_mkroute_input(struct sk_buff *skb,
 			    __be32 daddr, __be32 saddr, u32 tos)
 {
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-	if (res->fi && res->fi->fib_nhs > 1)
-		fib_select_multipath(res);
+	if (res->fi && res->fi->fib_nhs > 1) {
+		int h;
+
+		h = fib_multipath_hash(saddr, daddr);
+		fib_select_multipath(res, h);
+	}
 #endif
 
 	/* create a routing cache entry */
@@ -2190,8 +2194,12 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 	}
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-	if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0)
-		fib_select_multipath(&res);
+	if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) {
+		int h;
+
+		h = fib_multipath_hash(fl4->saddr, fl4->daddr);
+		fib_select_multipath(&res, h);
+	}
 	else
 #endif
 	if (!res.prefixlen &&
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 net-next 2/2] ipv4: ICMP packet inspection for multipath
  2015-09-23 19:49 [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing Peter Nørlund
  2015-09-23 19:49 ` [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath Peter Nørlund
@ 2015-09-23 19:49 ` Peter Nørlund
  2015-09-23 21:01   ` Tom Herbert
  2015-09-29  2:33 ` [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing David Miller
  2 siblings, 1 reply; 16+ messages in thread
From: Peter Nørlund @ 2015-09-23 19:49 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Peter Nørlund

ICMP packets are inspected to let them route together with the flow they
belong to, minimizing the chance that a problematic path will affect flows
on other paths, and so that anycast environments can work with ECMP.

Signed-off-by: Peter Nørlund <pch@ordbogen.com>
---
 include/net/route.h |   11 +++++++++-
 net/ipv4/icmp.c     |   19 ++++++++++++++++-
 net/ipv4/route.c    |   57 +++++++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 79 insertions(+), 8 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 10a7d21..0f7abdc 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -28,6 +28,7 @@
 #include <net/inetpeer.h>
 #include <net/flow.h>
 #include <net/inet_sock.h>
+#include <net/ip_fib.h>
 #include <linux/in_route.h>
 #include <linux/rtnetlink.h>
 #include <linux/rcupdate.h>
@@ -112,7 +113,15 @@ struct in_device;
 int ip_rt_init(void);
 void rt_cache_flush(struct net *net);
 void rt_flush_dev(struct net_device *dev);
-struct rtable *__ip_route_output_key(struct net *, struct flowi4 *flp);
+struct rtable *__ip_route_output_key_hash(struct net *, struct flowi4 *flp,
+					  int mp_hash);
+
+static inline struct rtable *__ip_route_output_key(struct net *net,
+						   struct flowi4 *flp)
+{
+	return __ip_route_output_key_hash(net, flp, -1);
+}
+
 struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
 				    struct sock *sk);
 struct dst_entry *ipv4_blackhole_route(struct net *net,
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 79fe05b..f81daca 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -440,6 +440,22 @@ out_unlock:
 	icmp_xmit_unlock(sk);
 }
 
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+
+/* Source and destination is swapped. See ip_multipath_icmp_hash */
+static int icmp_multipath_hash_skb(const struct sk_buff *skb)
+{
+	const struct iphdr *iph = ip_hdr(skb);
+
+	return fib_multipath_hash(iph->daddr, iph->saddr);
+}
+
+#else
+
+#define icmp_multipath_hash_skb(skb) (-1)
+
+#endif
+
 static struct rtable *icmp_route_lookup(struct net *net,
 					struct flowi4 *fl4,
 					struct sk_buff *skb_in,
@@ -464,7 +480,8 @@ static struct rtable *icmp_route_lookup(struct net *net,
 	fl4->flowi4_oif = vrf_master_ifindex(skb_in->dev) ? : skb_in->dev->ifindex;
 
 	security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
-	rt = __ip_route_output_key(net, fl4);
+	rt = __ip_route_output_key_hash(net, fl4,
+					icmp_multipath_hash_skb(skb_in));
 	if (IS_ERR(rt))
 		return rt;
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 3dc83b6..29048e1 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1652,6 +1652,48 @@ out:
 	return err;
 }
 
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+
+/* To make ICMP packets follow the right flow, the multipath hash is
+ * calculated from the inner IP addresses in reverse order.
+ */
+static int ip_multipath_icmp_hash(struct sk_buff *skb)
+{
+	const struct iphdr *outer_iph = ip_hdr(skb);
+	struct icmphdr _icmph;
+	const struct icmphdr *icmph;
+	struct iphdr _inner_iph;
+	const struct iphdr *inner_iph;
+
+	if (unlikely((outer_iph->frag_off & htons(IP_OFFSET)) != 0))
+		goto standard_hash;
+
+	icmph = skb_header_pointer(skb, outer_iph->ihl * 4, sizeof(_icmph),
+				   &_icmph);
+	if (!icmph)
+		goto standard_hash;
+
+	if (icmph->type != ICMP_DEST_UNREACH &&
+	    icmph->type != ICMP_REDIRECT &&
+	    icmph->type != ICMP_TIME_EXCEEDED &&
+	    icmph->type != ICMP_PARAMETERPROB) {
+		goto standard_hash;
+	}
+
+	inner_iph = skb_header_pointer(skb,
+				       outer_iph->ihl * 4 + sizeof(_icmph),
+				       sizeof(_inner_iph), &_inner_iph);
+	if (!inner_iph)
+		goto standard_hash;
+
+	return fib_multipath_hash(inner_iph->daddr, inner_iph->saddr);
+
+standard_hash:
+	return fib_multipath_hash(outer_iph->saddr, outer_iph->daddr);
+}
+
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
+
 static int ip_mkroute_input(struct sk_buff *skb,
 			    struct fib_result *res,
 			    const struct flowi4 *fl4,
@@ -1662,7 +1704,10 @@ static int ip_mkroute_input(struct sk_buff *skb,
 	if (res->fi && res->fi->fib_nhs > 1) {
 		int h;
 
-		h = fib_multipath_hash(saddr, daddr);
+		if (unlikely(ip_hdr(skb)->protocol == IPPROTO_ICMP))
+			h = ip_multipath_icmp_hash(skb);
+		else
+			h = fib_multipath_hash(saddr, daddr);
 		fib_select_multipath(res, h);
 	}
 #endif
@@ -2032,7 +2077,8 @@ add:
  * Major route resolver routine.
  */
 
-struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
+struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
+					  int mp_hash)
 {
 	struct net_device *dev_out = NULL;
 	__u8 tos = RT_FL_TOS(fl4);
@@ -2195,10 +2241,9 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) {
-		int h;
-
-		h = fib_multipath_hash(fl4->saddr, fl4->daddr);
-		fib_select_multipath(&res, h);
+		if (mp_hash < 0)
+			mp_hash = fib_multipath_hash(fl4->saddr, fl4->daddr);
+		fib_select_multipath(&res, mp_hash);
 	}
 	else
 #endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath
  2015-09-23 19:49 ` [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath Peter Nørlund
@ 2015-09-23 21:00   ` Tom Herbert
  2015-09-23 21:43     ` Peter Nørlund
  2015-09-23 23:09     ` Peter Nørlund
  2015-09-23 21:55   ` David Miller
  1 sibling, 2 replies; 16+ messages in thread
From: Tom Herbert @ 2015-09-23 21:00 UTC (permalink / raw)
  To: Peter Nørlund
  Cc: Linux Kernel Network Developers, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com> wrote:
> Replaces the per-packet multipath with a hash-based multipath using
> source and destination address.
>
It's good that round robin is going away, but this still looks very
different with how multipath routing is done done in IPv6
(rt6_multipath_select and rt6_info_hash_nhsfn). For instance IPv4
hashes addresses, but IPv6 includes ports. How can we rectify this?

> Signed-off-by: Peter Nørlund <pch@ordbogen.com>
> ---
>  include/net/ip_fib.h     |   14 ++++-
>  net/ipv4/fib_semantics.c |  140 +++++++++++++++++++++++++---------------------
>  net/ipv4/route.c         |   16 ++++--
>  3 files changed, 98 insertions(+), 72 deletions(-)
>
> diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> index a37d043..c454203 100644
> --- a/include/net/ip_fib.h
> +++ b/include/net/ip_fib.h
> @@ -79,7 +79,7 @@ struct fib_nh {
>         unsigned char           nh_scope;
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>         int                     nh_weight;
> -       int                     nh_power;
> +       atomic_t                nh_upper_bound;
>  #endif
>  #ifdef CONFIG_IP_ROUTE_CLASSID
>         __u32                   nh_tclassid;
> @@ -118,7 +118,7 @@ struct fib_info {
>  #define fib_advmss fib_metrics[RTAX_ADVMSS-1]
>         int                     fib_nhs;
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> -       int                     fib_power;
> +       int                     fib_weight;
>  #endif
>         struct rcu_head         rcu;
>         struct fib_nh           fib_nh[0];
> @@ -312,7 +312,15 @@ int ip_fib_check_default(__be32 gw, struct net_device *dev);
>  int fib_sync_down_dev(struct net_device *dev, unsigned long event);
>  int fib_sync_down_addr(struct net *net, __be32 local);
>  int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
> -void fib_select_multipath(struct fib_result *res);
> +
> +extern u32 fib_multipath_secret __read_mostly;
> +
> +static inline int fib_multipath_hash(__be32 saddr, __be32 daddr)
> +{
> +       return jhash_2words(saddr, daddr, fib_multipath_secret) >> 1;

Why the >> 1?

> +}
> +
> +void fib_select_multipath(struct fib_result *res, int hash);
>
>  /* Exported by fib_trie.c */
>  void fib_trie_init(void);
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index 064bd3c..0c49d2f 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -57,8 +57,7 @@ static unsigned int fib_info_cnt;
>  static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE];
>
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> -
> -static DEFINE_SPINLOCK(fib_multipath_lock);
> +u32 fib_multipath_secret __read_mostly;
>
>  #define for_nexthops(fi) {                                             \
>         int nhsel; const struct fib_nh *nh;                             \
> @@ -532,7 +531,67 @@ errout:
>         return ret;
>  }
>
> -#endif
> +static void fib_rebalance(struct fib_info *fi)
> +{
> +       int total;
> +       int w;
> +       struct in_device *in_dev;
> +
> +       if (fi->fib_nhs < 2)
> +               return;
> +
> +       total = 0;
> +       for_nexthops(fi) {
> +               if (nh->nh_flags & RTNH_F_DEAD)
> +                       continue;
> +
> +               in_dev = __in_dev_get_rcu(nh->nh_dev);
> +
> +               if (in_dev &&
> +                   IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
> +                   nh->nh_flags & RTNH_F_LINKDOWN)
> +                       continue;
> +
> +               total += nh->nh_weight;
> +       } endfor_nexthops(fi);
> +
> +       w = 0;
> +       change_nexthops(fi) {
> +               int upper_bound;
> +
> +               in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev);
> +
> +               if (nexthop_nh->nh_flags & RTNH_F_DEAD) {
> +                       upper_bound = -1;
> +               } else if (in_dev &&
> +                          IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
> +                          nexthop_nh->nh_flags & RTNH_F_LINKDOWN) {
> +                       upper_bound = -1;
> +               } else {
> +                       w += nexthop_nh->nh_weight;
> +                       upper_bound = DIV_ROUND_CLOSEST(2147483648LL * w,
> +                                                       total) - 1;
> +               }
> +

Is this complexity (aside from recomputing the key) needed because you
want consistent hashing for the anycast case?

> +               atomic_set(&nexthop_nh->nh_upper_bound, upper_bound);
> +       } endfor_nexthops(fi);
> +
> +       net_get_random_once(&fib_multipath_secret,
> +                           sizeof(fib_multipath_secret));
> +}
> +
> +static inline void fib_add_weight(struct fib_info *fi,
> +                                 const struct fib_nh *nh)
> +{
> +       fi->fib_weight += nh->nh_weight;
> +}
> +
> +#else /* CONFIG_IP_ROUTE_MULTIPATH */
> +
> +#define fib_rebalance(fi) do { } while (0)
> +#define fib_add_weight(fi, nh) do { } while (0)
> +
> +#endif /* CONFIG_IP_ROUTE_MULTIPATH */
>
>  static int fib_encap_match(struct net *net, u16 encap_type,
>                            struct nlattr *encap,
> @@ -1094,8 +1153,11 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
>
>         change_nexthops(fi) {
>                 fib_info_update_nh_saddr(net, nexthop_nh);
> +               fib_add_weight(fi, nexthop_nh);
>         } endfor_nexthops(fi)
>
> +       fib_rebalance(fi);
> +
>  link_it:
>         ofi = fib_find_info(fi);
>         if (ofi) {
> @@ -1317,12 +1379,6 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event)
>                                         nexthop_nh->nh_flags |= RTNH_F_LINKDOWN;
>                                         break;
>                                 }
> -#ifdef CONFIG_IP_ROUTE_MULTIPATH
> -                               spin_lock_bh(&fib_multipath_lock);
> -                               fi->fib_power -= nexthop_nh->nh_power;
> -                               nexthop_nh->nh_power = 0;
> -                               spin_unlock_bh(&fib_multipath_lock);
> -#endif
>                                 dead++;
>                         }
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> @@ -1345,6 +1401,8 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event)
>                         }
>                         ret++;
>                 }
> +
> +               fib_rebalance(fi);
>         }
>
>         return ret;
> @@ -1467,20 +1525,15 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
>                             !__in_dev_get_rtnl(dev))
>                                 continue;
>                         alive++;
> -#ifdef CONFIG_IP_ROUTE_MULTIPATH
> -                       spin_lock_bh(&fib_multipath_lock);
> -                       nexthop_nh->nh_power = 0;
> -                       nexthop_nh->nh_flags &= ~nh_flags;
> -                       spin_unlock_bh(&fib_multipath_lock);
> -#else
>                         nexthop_nh->nh_flags &= ~nh_flags;
> -#endif
>                 } endfor_nexthops(fi)
>
>                 if (alive > 0) {
>                         fi->fib_flags &= ~nh_flags;
>                         ret++;
>                 }
> +
> +               fib_rebalance(fi);
>         }
>
>         return ret;
> @@ -1488,62 +1541,19 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
>
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>
> -/*
> - * The algorithm is suboptimal, but it provides really
> - * fair weighted route distribution.
> - */
> -void fib_select_multipath(struct fib_result *res)
> +void fib_select_multipath(struct fib_result *res, int hash)
>  {
>         struct fib_info *fi = res->fi;
> -       struct in_device *in_dev;
> -       int w;
> -
> -       spin_lock_bh(&fib_multipath_lock);
> -       if (fi->fib_power <= 0) {
> -               int power = 0;
> -               change_nexthops(fi) {
> -                       in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev);
> -                       if (nexthop_nh->nh_flags & RTNH_F_DEAD)
> -                               continue;
> -                       if (in_dev &&
> -                           IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
> -                           nexthop_nh->nh_flags & RTNH_F_LINKDOWN)
> -                               continue;
> -                       power += nexthop_nh->nh_weight;
> -                       nexthop_nh->nh_power = nexthop_nh->nh_weight;
> -               } endfor_nexthops(fi);
> -               fi->fib_power = power;
> -               if (power <= 0) {
> -                       spin_unlock_bh(&fib_multipath_lock);
> -                       /* Race condition: route has just become dead. */
> -                       res->nh_sel = 0;
> -                       return;
> -               }
> -       }
> -
>
> -       /* w should be random number [0..fi->fib_power-1],
> -        * it is pretty bad approximation.
> -        */
> -
> -       w = jiffies % fi->fib_power;
> +       for_nexthops(fi) {
> +               if (hash > atomic_read(&nh->nh_upper_bound))
> +                       continue;
>
> -       change_nexthops(fi) {
> -               if (!(nexthop_nh->nh_flags & RTNH_F_DEAD) &&
> -                   nexthop_nh->nh_power) {
> -                       w -= nexthop_nh->nh_power;
> -                       if (w <= 0) {
> -                               nexthop_nh->nh_power--;
> -                               fi->fib_power--;
> -                               res->nh_sel = nhsel;
> -                               spin_unlock_bh(&fib_multipath_lock);
> -                               return;
> -                       }
> -               }
> +               res->nh_sel = nhsel;
> +               return;
>         } endfor_nexthops(fi);
>
>         /* Race condition: route has just become dead. */
>         res->nh_sel = 0;
> -       spin_unlock_bh(&fib_multipath_lock);
>  }
>  #endif
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 80f7c5b..3dc83b6 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1659,8 +1659,12 @@ static int ip_mkroute_input(struct sk_buff *skb,
>                             __be32 daddr, __be32 saddr, u32 tos)
>  {
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> -       if (res->fi && res->fi->fib_nhs > 1)
> -               fib_select_multipath(res);
> +       if (res->fi && res->fi->fib_nhs > 1) {
> +               int h;
> +
> +               h = fib_multipath_hash(saddr, daddr);
> +               fib_select_multipath(res, h);
> +       }
>  #endif
>
>         /* create a routing cache entry */
> @@ -2190,8 +2194,12 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
>         }
>
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> -       if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0)
> -               fib_select_multipath(&res);
> +       if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) {
> +               int h;
> +
> +               h = fib_multipath_hash(fl4->saddr, fl4->daddr);
> +               fib_select_multipath(&res, h);
> +       }
>         else
>  #endif
>         if (!res.prefixlen &&
> --
> 1.7.10.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 net-next 2/2] ipv4: ICMP packet inspection for multipath
  2015-09-23 19:49 ` [PATCH v4 net-next 2/2] ipv4: ICMP packet inspection for multipath Peter Nørlund
@ 2015-09-23 21:01   ` Tom Herbert
  2015-09-23 22:07     ` Peter Nørlund
  0 siblings, 1 reply; 16+ messages in thread
From: Tom Herbert @ 2015-09-23 21:01 UTC (permalink / raw)
  To: Peter Nørlund
  Cc: Linux Kernel Network Developers, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com> wrote:
> ICMP packets are inspected to let them route together with the flow they
> belong to, minimizing the chance that a problematic path will affect flows
> on other paths, and so that anycast environments can work with ECMP.
>
> Signed-off-by: Peter Nørlund <pch@ordbogen.com>
> ---
>  include/net/route.h |   11 +++++++++-
>  net/ipv4/icmp.c     |   19 ++++++++++++++++-
>  net/ipv4/route.c    |   57 +++++++++++++++++++++++++++++++++++++++++++++------
>  3 files changed, 79 insertions(+), 8 deletions(-)
>
> diff --git a/include/net/route.h b/include/net/route.h
> index 10a7d21..0f7abdc 100644
> --- a/include/net/route.h
> +++ b/include/net/route.h
> @@ -28,6 +28,7 @@
>  #include <net/inetpeer.h>
>  #include <net/flow.h>
>  #include <net/inet_sock.h>
> +#include <net/ip_fib.h>
>  #include <linux/in_route.h>
>  #include <linux/rtnetlink.h>
>  #include <linux/rcupdate.h>
> @@ -112,7 +113,15 @@ struct in_device;
>  int ip_rt_init(void);
>  void rt_cache_flush(struct net *net);
>  void rt_flush_dev(struct net_device *dev);
> -struct rtable *__ip_route_output_key(struct net *, struct flowi4 *flp);
> +struct rtable *__ip_route_output_key_hash(struct net *, struct flowi4 *flp,
> +                                         int mp_hash);
> +
> +static inline struct rtable *__ip_route_output_key(struct net *net,
> +                                                  struct flowi4 *flp)
> +{
> +       return __ip_route_output_key_hash(net, flp, -1);
> +}
> +
>  struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
>                                     struct sock *sk);
>  struct dst_entry *ipv4_blackhole_route(struct net *net,
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index 79fe05b..f81daca 100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -440,6 +440,22 @@ out_unlock:
>         icmp_xmit_unlock(sk);
>  }
>
> +#ifdef CONFIG_IP_ROUTE_MULTIPATH
> +
> +/* Source and destination is swapped. See ip_multipath_icmp_hash */
> +static int icmp_multipath_hash_skb(const struct sk_buff *skb)
> +{
> +       const struct iphdr *iph = ip_hdr(skb);
> +
> +       return fib_multipath_hash(iph->daddr, iph->saddr);
> +}
> +
> +#else
> +
> +#define icmp_multipath_hash_skb(skb) (-1)
> +
> +#endif
> +
>  static struct rtable *icmp_route_lookup(struct net *net,
>                                         struct flowi4 *fl4,
>                                         struct sk_buff *skb_in,
> @@ -464,7 +480,8 @@ static struct rtable *icmp_route_lookup(struct net *net,
>         fl4->flowi4_oif = vrf_master_ifindex(skb_in->dev) ? : skb_in->dev->ifindex;
>
>         security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
> -       rt = __ip_route_output_key(net, fl4);
> +       rt = __ip_route_output_key_hash(net, fl4,
> +                                       icmp_multipath_hash_skb(skb_in));
>         if (IS_ERR(rt))
>                 return rt;
>
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 3dc83b6..29048e1 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1652,6 +1652,48 @@ out:
>         return err;
>  }
>
> +#ifdef CONFIG_IP_ROUTE_MULTIPATH
> +
> +/* To make ICMP packets follow the right flow, the multipath hash is
> + * calculated from the inner IP addresses in reverse order.
> + */
> +static int ip_multipath_icmp_hash(struct sk_buff *skb)
> +{

This seems like overkill to me. Even if you do this on the host no
device in the network is going to crack open ICMP like this for doing
its ECMP. I suppose the argument is that this is another hack needed
to make anycast "work"...

> +       const struct iphdr *outer_iph = ip_hdr(skb);
> +       struct icmphdr _icmph;
> +       const struct icmphdr *icmph;
> +       struct iphdr _inner_iph;
> +       const struct iphdr *inner_iph;
> +
> +       if (unlikely((outer_iph->frag_off & htons(IP_OFFSET)) != 0))
> +               goto standard_hash;
> +
> +       icmph = skb_header_pointer(skb, outer_iph->ihl * 4, sizeof(_icmph),
> +                                  &_icmph);
> +       if (!icmph)
> +               goto standard_hash;
> +
> +       if (icmph->type != ICMP_DEST_UNREACH &&
> +           icmph->type != ICMP_REDIRECT &&
> +           icmph->type != ICMP_TIME_EXCEEDED &&
> +           icmph->type != ICMP_PARAMETERPROB) {
> +               goto standard_hash;
> +       }
> +
> +       inner_iph = skb_header_pointer(skb,
> +                                      outer_iph->ihl * 4 + sizeof(_icmph),
> +                                      sizeof(_inner_iph), &_inner_iph);
> +       if (!inner_iph)
> +               goto standard_hash;
> +
> +       return fib_multipath_hash(inner_iph->daddr, inner_iph->saddr);
> +
> +standard_hash:
> +       return fib_multipath_hash(outer_iph->saddr, outer_iph->daddr);
> +}
> +
> +#endif /* CONFIG_IP_ROUTE_MULTIPATH */
> +
>  static int ip_mkroute_input(struct sk_buff *skb,
>                             struct fib_result *res,
>                             const struct flowi4 *fl4,
> @@ -1662,7 +1704,10 @@ static int ip_mkroute_input(struct sk_buff *skb,
>         if (res->fi && res->fi->fib_nhs > 1) {
>                 int h;
>
> -               h = fib_multipath_hash(saddr, daddr);
> +               if (unlikely(ip_hdr(skb)->protocol == IPPROTO_ICMP))
> +                       h = ip_multipath_icmp_hash(skb);
> +               else
> +                       h = fib_multipath_hash(saddr, daddr);
>                 fib_select_multipath(res, h);
>         }
>  #endif
> @@ -2032,7 +2077,8 @@ add:
>   * Major route resolver routine.
>   */
>
> -struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
> +struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
> +                                         int mp_hash)
>  {
>         struct net_device *dev_out = NULL;
>         __u8 tos = RT_FL_TOS(fl4);
> @@ -2195,10 +2241,9 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
>
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>         if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) {
> -               int h;
> -
> -               h = fib_multipath_hash(fl4->saddr, fl4->daddr);
> -               fib_select_multipath(&res, h);
> +               if (mp_hash < 0)
> +                       mp_hash = fib_multipath_hash(fl4->saddr, fl4->daddr);
> +               fib_select_multipath(&res, mp_hash);
>         }
>         else
>  #endif
> --
> 1.7.10.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath
  2015-09-23 21:00   ` Tom Herbert
@ 2015-09-23 21:43     ` Peter Nørlund
  2015-09-23 23:12       ` Tom Herbert
  2015-09-23 23:09     ` Peter Nørlund
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Nørlund @ 2015-09-23 21:43 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

On Wed, 23 Sep 2015 14:00:43 -0700
Tom Herbert <tom@herbertland.com> wrote:

> On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com>
> wrote:
> > Replaces the per-packet multipath with a hash-based multipath using
> > source and destination address.
> >
> It's good that round robin is going away, but this still looks very
> different with how multipath routing is done done in IPv6
> (rt6_multipath_select and rt6_info_hash_nhsfn). For instance IPv4
> hashes addresses, but IPv6 includes ports. How can we rectify this?
> 
> > Signed-off-by: Peter Nørlund <pch@ordbogen.com>
> > ---
> >  include/net/ip_fib.h     |   14 ++++-
> >  net/ipv4/fib_semantics.c |  140
> > +++++++++++++++++++++++++---------------------
> > net/ipv4/route.c         |   16 ++++-- 3 files changed, 98
> > insertions(+), 72 deletions(-)
> >
> > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> > index a37d043..c454203 100644
> > --- a/include/net/ip_fib.h
> > +++ b/include/net/ip_fib.h
> > @@ -79,7 +79,7 @@ struct fib_nh {
> >         unsigned char           nh_scope;
> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> >         int                     nh_weight;
> > -       int                     nh_power;
> > +       atomic_t                nh_upper_bound;
> >  #endif
> >  #ifdef CONFIG_IP_ROUTE_CLASSID
> >         __u32                   nh_tclassid;
> > @@ -118,7 +118,7 @@ struct fib_info {
> >  #define fib_advmss fib_metrics[RTAX_ADVMSS-1]
> >         int                     fib_nhs;
> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> > -       int                     fib_power;
> > +       int                     fib_weight;
> >  #endif
> >         struct rcu_head         rcu;
> >         struct fib_nh           fib_nh[0];
> > @@ -312,7 +312,15 @@ int ip_fib_check_default(__be32 gw, struct
> > net_device *dev); int fib_sync_down_dev(struct net_device *dev,
> > unsigned long event); int fib_sync_down_addr(struct net *net,
> > __be32 local); int fib_sync_up(struct net_device *dev, unsigned int
> > nh_flags); -void fib_select_multipath(struct fib_result *res);
> > +
> > +extern u32 fib_multipath_secret __read_mostly;
> > +
> > +static inline int fib_multipath_hash(__be32 saddr, __be32 daddr)
> > +{
> > +       return jhash_2words(saddr, daddr, fib_multipath_secret) >>
> > 1;
> 
> Why the >> 1?

Each nexthop is assigned a range within the hash key space with the
upper bound specified in nh_upper_bound, but I needed a way to show
that a nexthop was dead within that same atomic variable. I chose -1
reducing the actual hash to 31 bits.  I shift the hash because I expect
it to be slightly faster than AND'ing with 0x7FFFFFFF on most platforms.

> 
> > +}
> > +
> > +void fib_select_multipath(struct fib_result *res, int hash);
> >
> >  /* Exported by fib_trie.c */
> >  void fib_trie_init(void);
> > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> > index 064bd3c..0c49d2f 100644
> > --- a/net/ipv4/fib_semantics.c
> > +++ b/net/ipv4/fib_semantics.c
> > @@ -57,8 +57,7 @@ static unsigned int fib_info_cnt;
> >  static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE];
> >
> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> > -
> > -static DEFINE_SPINLOCK(fib_multipath_lock);
> > +u32 fib_multipath_secret __read_mostly;
> >
> >  #define for_nexthops(fi)
> > {                                             \ int nhsel; const
> > struct fib_nh *nh;                             \ @@ -532,7 +531,67
> > @@ errout: return ret;
> >  }
> >
> > -#endif
> > +static void fib_rebalance(struct fib_info *fi)
> > +{
> > +       int total;
> > +       int w;
> > +       struct in_device *in_dev;
> > +
> > +       if (fi->fib_nhs < 2)
> > +               return;
> > +
> > +       total = 0;
> > +       for_nexthops(fi) {
> > +               if (nh->nh_flags & RTNH_F_DEAD)
> > +                       continue;
> > +
> > +               in_dev = __in_dev_get_rcu(nh->nh_dev);
> > +
> > +               if (in_dev &&
> > +                   IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
> > +                   nh->nh_flags & RTNH_F_LINKDOWN)
> > +                       continue;
> > +
> > +               total += nh->nh_weight;
> > +       } endfor_nexthops(fi);
> > +
> > +       w = 0;
> > +       change_nexthops(fi) {
> > +               int upper_bound;
> > +
> > +               in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev);
> > +
> > +               if (nexthop_nh->nh_flags & RTNH_F_DEAD) {
> > +                       upper_bound = -1;
> > +               } else if (in_dev &&
> > +
> > IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
> > +                          nexthop_nh->nh_flags & RTNH_F_LINKDOWN) {
> > +                       upper_bound = -1;
> > +               } else {
> > +                       w += nexthop_nh->nh_weight;
> > +                       upper_bound =
> > DIV_ROUND_CLOSEST(2147483648LL * w,
> > +                                                       total) - 1;
> > +               }
> > +
> 
> Is this complexity (aside from recomputing the key) needed because you
> want consistent hashing for the anycast case?

It is an attempt to reduce the disruption when a path is added or
removed in a way recommended by RFC 2992. It is not specific to anycast
at all, although disruptions are more severe for anycast than for
ordinary TCP which might just slow down due to out of order packets.

Traditionally people used modulo to calculate the path from the hash,
but with the modulo approach 1/2 of all flows would change paths when a
path is added or removed. With the hash threshold approach, somewhere
between 1/4 and 1/2 of all flows are affected. So an improvement for
routes with more than two paths.

Ideally you would have a hash-to-port mapping table or similar,
ensuring that only 1/N of all flows were affected (with N being the
number of paths).

> 
> > +               atomic_set(&nexthop_nh->nh_upper_bound,
> > upper_bound);
> > +       } endfor_nexthops(fi);
> > +
> > +       net_get_random_once(&fib_multipath_secret,
> > +                           sizeof(fib_multipath_secret));
> > +}
> > +
> > +static inline void fib_add_weight(struct fib_info *fi,
> > +                                 const struct fib_nh *nh)
> > +{
> > +       fi->fib_weight += nh->nh_weight;
> > +}
> > +
> > +#else /* CONFIG_IP_ROUTE_MULTIPATH */
> > +
> > +#define fib_rebalance(fi) do { } while (0)
> > +#define fib_add_weight(fi, nh) do { } while (0)
> > +
> > +#endif /* CONFIG_IP_ROUTE_MULTIPATH */
> >
> >  static int fib_encap_match(struct net *net, u16 encap_type,
> >                            struct nlattr *encap,
> > @@ -1094,8 +1153,11 @@ struct fib_info *fib_create_info(struct
> > fib_config *cfg)
> >
> >         change_nexthops(fi) {
> >                 fib_info_update_nh_saddr(net, nexthop_nh);
> > +               fib_add_weight(fi, nexthop_nh);
> >         } endfor_nexthops(fi)
> >
> > +       fib_rebalance(fi);
> > +
> >  link_it:
> >         ofi = fib_find_info(fi);
> >         if (ofi) {
> > @@ -1317,12 +1379,6 @@ int fib_sync_down_dev(struct net_device
> > *dev, unsigned long event) nexthop_nh->nh_flags |= RTNH_F_LINKDOWN;
> >                                         break;
> >                                 }
> > -#ifdef CONFIG_IP_ROUTE_MULTIPATH
> > -                               spin_lock_bh(&fib_multipath_lock);
> > -                               fi->fib_power -=
> > nexthop_nh->nh_power;
> > -                               nexthop_nh->nh_power = 0;
> > -                               spin_unlock_bh(&fib_multipath_lock);
> > -#endif
> >                                 dead++;
> >                         }
> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> > @@ -1345,6 +1401,8 @@ int fib_sync_down_dev(struct net_device *dev,
> > unsigned long event) }
> >                         ret++;
> >                 }
> > +
> > +               fib_rebalance(fi);
> >         }
> >
> >         return ret;
> > @@ -1467,20 +1525,15 @@ int fib_sync_up(struct net_device *dev,
> > unsigned int nh_flags) !__in_dev_get_rtnl(dev))
> >                                 continue;
> >                         alive++;
> > -#ifdef CONFIG_IP_ROUTE_MULTIPATH
> > -                       spin_lock_bh(&fib_multipath_lock);
> > -                       nexthop_nh->nh_power = 0;
> > -                       nexthop_nh->nh_flags &= ~nh_flags;
> > -                       spin_unlock_bh(&fib_multipath_lock);
> > -#else
> >                         nexthop_nh->nh_flags &= ~nh_flags;
> > -#endif
> >                 } endfor_nexthops(fi)
> >
> >                 if (alive > 0) {
> >                         fi->fib_flags &= ~nh_flags;
> >                         ret++;
> >                 }
> > +
> > +               fib_rebalance(fi);
> >         }
> >
> >         return ret;
> > @@ -1488,62 +1541,19 @@ int fib_sync_up(struct net_device *dev,
> > unsigned int nh_flags)
> >
> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> >
> > -/*
> > - * The algorithm is suboptimal, but it provides really
> > - * fair weighted route distribution.
> > - */
> > -void fib_select_multipath(struct fib_result *res)
> > +void fib_select_multipath(struct fib_result *res, int hash)
> >  {
> >         struct fib_info *fi = res->fi;
> > -       struct in_device *in_dev;
> > -       int w;
> > -
> > -       spin_lock_bh(&fib_multipath_lock);
> > -       if (fi->fib_power <= 0) {
> > -               int power = 0;
> > -               change_nexthops(fi) {
> > -                       in_dev =
> > __in_dev_get_rcu(nexthop_nh->nh_dev);
> > -                       if (nexthop_nh->nh_flags & RTNH_F_DEAD)
> > -                               continue;
> > -                       if (in_dev &&
> > -
> > IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
> > -                           nexthop_nh->nh_flags & RTNH_F_LINKDOWN)
> > -                               continue;
> > -                       power += nexthop_nh->nh_weight;
> > -                       nexthop_nh->nh_power =
> > nexthop_nh->nh_weight;
> > -               } endfor_nexthops(fi);
> > -               fi->fib_power = power;
> > -               if (power <= 0) {
> > -                       spin_unlock_bh(&fib_multipath_lock);
> > -                       /* Race condition: route has just become
> > dead. */
> > -                       res->nh_sel = 0;
> > -                       return;
> > -               }
> > -       }
> > -
> >
> > -       /* w should be random number [0..fi->fib_power-1],
> > -        * it is pretty bad approximation.
> > -        */
> > -
> > -       w = jiffies % fi->fib_power;
> > +       for_nexthops(fi) {
> > +               if (hash > atomic_read(&nh->nh_upper_bound))
> > +                       continue;
> >
> > -       change_nexthops(fi) {
> > -               if (!(nexthop_nh->nh_flags & RTNH_F_DEAD) &&
> > -                   nexthop_nh->nh_power) {
> > -                       w -= nexthop_nh->nh_power;
> > -                       if (w <= 0) {
> > -                               nexthop_nh->nh_power--;
> > -                               fi->fib_power--;
> > -                               res->nh_sel = nhsel;
> > -                               spin_unlock_bh(&fib_multipath_lock);
> > -                               return;
> > -                       }
> > -               }
> > +               res->nh_sel = nhsel;
> > +               return;
> >         } endfor_nexthops(fi);
> >
> >         /* Race condition: route has just become dead. */
> >         res->nh_sel = 0;
> > -       spin_unlock_bh(&fib_multipath_lock);
> >  }
> >  #endif
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index 80f7c5b..3dc83b6 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1659,8 +1659,12 @@ static int ip_mkroute_input(struct sk_buff
> > *skb, __be32 daddr, __be32 saddr, u32 tos)
> >  {
> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> > -       if (res->fi && res->fi->fib_nhs > 1)
> > -               fib_select_multipath(res);
> > +       if (res->fi && res->fi->fib_nhs > 1) {
> > +               int h;
> > +
> > +               h = fib_multipath_hash(saddr, daddr);
> > +               fib_select_multipath(res, h);
> > +       }
> >  #endif
> >
> >         /* create a routing cache entry */
> > @@ -2190,8 +2194,12 @@ struct rtable *__ip_route_output_key(struct
> > net *net, struct flowi4 *fl4) }
> >
> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> > -       if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0)
> > -               fib_select_multipath(&res);
> > +       if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) {
> > +               int h;
> > +
> > +               h = fib_multipath_hash(fl4->saddr, fl4->daddr);
> > +               fib_select_multipath(&res, h);
> > +       }
> >         else
> >  #endif
> >         if (!res.prefixlen &&
> > --
> > 1.7.10.4
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath
  2015-09-23 19:49 ` [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath Peter Nørlund
  2015-09-23 21:00   ` Tom Herbert
@ 2015-09-23 21:55   ` David Miller
  1 sibling, 0 replies; 16+ messages in thread
From: David Miller @ 2015-09-23 21:55 UTC (permalink / raw)
  To: pch; +Cc: netdev, kuznet, jmorris, yoshfuji, kaber

From: Peter Nørlund <pch@ordbogen.com>
Date: Wed, 23 Sep 2015 21:49:36 +0200

> +static inline int fib_multipath_hash(__be32 saddr, __be32 daddr)
> +{
> +	return jhash_2words(saddr, daddr, fib_multipath_secret) >> 1;
> +}

Like Tom I'm curious why you do this shift here?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 net-next 2/2] ipv4: ICMP packet inspection for multipath
  2015-09-23 21:01   ` Tom Herbert
@ 2015-09-23 22:07     ` Peter Nørlund
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Nørlund @ 2015-09-23 22:07 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

On Wed, 23 Sep 2015 14:01:23 -0700
Tom Herbert <tom@herbertland.com> wrote:

> On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com>
> wrote:
> > ICMP packets are inspected to let them route together with the flow
> > they belong to, minimizing the chance that a problematic path will
> > affect flows on other paths, and so that anycast environments can
> > work with ECMP.
> >
> > Signed-off-by: Peter Nørlund <pch@ordbogen.com>
> > ---
> >  include/net/route.h |   11 +++++++++-
> >  net/ipv4/icmp.c     |   19 ++++++++++++++++-
> >  net/ipv4/route.c    |   57
> > +++++++++++++++++++++++++++++++++++++++++++++------ 3 files
> > changed, 79 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/net/route.h b/include/net/route.h
> > index 10a7d21..0f7abdc 100644
> > --- a/include/net/route.h
> > +++ b/include/net/route.h
> > @@ -28,6 +28,7 @@
> >  #include <net/inetpeer.h>
> >  #include <net/flow.h>
> >  #include <net/inet_sock.h>
> > +#include <net/ip_fib.h>
> >  #include <linux/in_route.h>
> >  #include <linux/rtnetlink.h>
> >  #include <linux/rcupdate.h>
> > @@ -112,7 +113,15 @@ struct in_device;
> >  int ip_rt_init(void);
> >  void rt_cache_flush(struct net *net);
> >  void rt_flush_dev(struct net_device *dev);
> > -struct rtable *__ip_route_output_key(struct net *, struct flowi4
> > *flp); +struct rtable *__ip_route_output_key_hash(struct net *,
> > struct flowi4 *flp,
> > +                                         int mp_hash);
> > +
> > +static inline struct rtable *__ip_route_output_key(struct net *net,
> > +                                                  struct flowi4
> > *flp) +{
> > +       return __ip_route_output_key_hash(net, flp, -1);
> > +}
> > +
> >  struct rtable *ip_route_output_flow(struct net *, struct flowi4
> > *flp, struct sock *sk);
> >  struct dst_entry *ipv4_blackhole_route(struct net *net,
> > diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> > index 79fe05b..f81daca 100644
> > --- a/net/ipv4/icmp.c
> > +++ b/net/ipv4/icmp.c
> > @@ -440,6 +440,22 @@ out_unlock:
> >         icmp_xmit_unlock(sk);
> >  }
> >
> > +#ifdef CONFIG_IP_ROUTE_MULTIPATH
> > +
> > +/* Source and destination is swapped. See ip_multipath_icmp_hash */
> > +static int icmp_multipath_hash_skb(const struct sk_buff *skb)
> > +{
> > +       const struct iphdr *iph = ip_hdr(skb);
> > +
> > +       return fib_multipath_hash(iph->daddr, iph->saddr);
> > +}
> > +
> > +#else
> > +
> > +#define icmp_multipath_hash_skb(skb) (-1)
> > +
> > +#endif
> > +
> >  static struct rtable *icmp_route_lookup(struct net *net,
> >                                         struct flowi4 *fl4,
> >                                         struct sk_buff *skb_in,
> > @@ -464,7 +480,8 @@ static struct rtable *icmp_route_lookup(struct
> > net *net, fl4->flowi4_oif = vrf_master_ifindex(skb_in->dev) ? :
> > skb_in->dev->ifindex;
> >
> >         security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
> > -       rt = __ip_route_output_key(net, fl4);
> > +       rt = __ip_route_output_key_hash(net, fl4,
> > +
> > icmp_multipath_hash_skb(skb_in)); if (IS_ERR(rt))
> >                 return rt;
> >
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index 3dc83b6..29048e1 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1652,6 +1652,48 @@ out:
> >         return err;
> >  }
> >
> > +#ifdef CONFIG_IP_ROUTE_MULTIPATH
> > +
> > +/* To make ICMP packets follow the right flow, the multipath hash
> > is
> > + * calculated from the inner IP addresses in reverse order.
> > + */
> > +static int ip_multipath_icmp_hash(struct sk_buff *skb)
> > +{
> 
> This seems like overkill to me. Even if you do this on the host no
> device in the network is going to crack open ICMP like this for doing
> its ECMP. I suppose the argument is that this is another hack needed
> to make anycast "work"...

True, this is only _really_ needed for anycast setups, although with
IPv4 it is solvable by disabling PMTU on the anycast machines, or using
Packetization Layer Path MTU Discovery. IPv6 is actually more impacted
by it (There is an IETF draft in the making,
draft-ietf-v6ops-pmtud-ecmp-problem-04).

I fully understand if the final verdict on the ICMP-handling is a
loud NO!, but my rationale for trying anyway is as follows:

1. The before-mentioned IETF draft actually recommends the behavior if
possible.

2. Non-anycast setups may also benefit from it by truly isolating
which flows are affected by which paths. If one path was experiencing
packet loss or delays or whatever, it could affect the ICMP errors
packets of the flows on the other paths.

3. I did consider making it a sysctl thing, but having it always on
doesn't break anything for anybody, yet it solves something for a few.

4. The alternative solutions I've seen today involves running a user
space daemon which receives the ICMP packets through NFLOG and then
broadcasts the packets to all paths. Compared to that, this if far more
efficient.

> 
> > +       const struct iphdr *outer_iph = ip_hdr(skb);
> > +       struct icmphdr _icmph;
> > +       const struct icmphdr *icmph;
> > +       struct iphdr _inner_iph;
> > +       const struct iphdr *inner_iph;
> > +
> > +       if (unlikely((outer_iph->frag_off & htons(IP_OFFSET)) != 0))
> > +               goto standard_hash;
> > +
> > +       icmph = skb_header_pointer(skb, outer_iph->ihl * 4,
> > sizeof(_icmph),
> > +                                  &_icmph);
> > +       if (!icmph)
> > +               goto standard_hash;
> > +
> > +       if (icmph->type != ICMP_DEST_UNREACH &&
> > +           icmph->type != ICMP_REDIRECT &&
> > +           icmph->type != ICMP_TIME_EXCEEDED &&
> > +           icmph->type != ICMP_PARAMETERPROB) {
> > +               goto standard_hash;
> > +       }
> > +
> > +       inner_iph = skb_header_pointer(skb,
> > +                                      outer_iph->ihl * 4 +
> > sizeof(_icmph),
> > +                                      sizeof(_inner_iph),
> > &_inner_iph);
> > +       if (!inner_iph)
> > +               goto standard_hash;
> > +
> > +       return fib_multipath_hash(inner_iph->daddr,
> > inner_iph->saddr); +
> > +standard_hash:
> > +       return fib_multipath_hash(outer_iph->saddr,
> > outer_iph->daddr); +}
> > +
> > +#endif /* CONFIG_IP_ROUTE_MULTIPATH */
> > +
> >  static int ip_mkroute_input(struct sk_buff *skb,
> >                             struct fib_result *res,
> >                             const struct flowi4 *fl4,
> > @@ -1662,7 +1704,10 @@ static int ip_mkroute_input(struct sk_buff
> > *skb, if (res->fi && res->fi->fib_nhs > 1) {
> >                 int h;
> >
> > -               h = fib_multipath_hash(saddr, daddr);
> > +               if (unlikely(ip_hdr(skb)->protocol == IPPROTO_ICMP))
> > +                       h = ip_multipath_icmp_hash(skb);
> > +               else
> > +                       h = fib_multipath_hash(saddr, daddr);
> >                 fib_select_multipath(res, h);
> >         }
> >  #endif
> > @@ -2032,7 +2077,8 @@ add:
> >   * Major route resolver routine.
> >   */
> >
> > -struct rtable *__ip_route_output_key(struct net *net, struct
> > flowi4 *fl4) +struct rtable *__ip_route_output_key_hash(struct net
> > *net, struct flowi4 *fl4,
> > +                                         int mp_hash)
> >  {
> >         struct net_device *dev_out = NULL;
> >         __u8 tos = RT_FL_TOS(fl4);
> > @@ -2195,10 +2241,9 @@ struct rtable *__ip_route_output_key(struct
> > net *net, struct flowi4 *fl4)
> >
> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> >         if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) {
> > -               int h;
> > -
> > -               h = fib_multipath_hash(fl4->saddr, fl4->daddr);
> > -               fib_select_multipath(&res, h);
> > +               if (mp_hash < 0)
> > +                       mp_hash = fib_multipath_hash(fl4->saddr,
> > fl4->daddr);
> > +               fib_select_multipath(&res, mp_hash);
> >         }
> >         else
> >  #endif
> > --
> > 1.7.10.4
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath
  2015-09-23 21:00   ` Tom Herbert
  2015-09-23 21:43     ` Peter Nørlund
@ 2015-09-23 23:09     ` Peter Nørlund
  2015-09-23 23:37       ` Tom Herbert
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Nørlund @ 2015-09-23 23:09 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

On Wed, 23 Sep 2015 14:00:43 -0700
Tom Herbert <tom@herbertland.com> wrote:

> On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com>
> wrote:
> > Replaces the per-packet multipath with a hash-based multipath using
> > source and destination address.
> >
> It's good that round robin is going away, but this still looks very
> different with how multipath routing is done done in IPv6
> (rt6_multipath_select and rt6_info_hash_nhsfn). For instance IPv4
> hashes addresses, but IPv6 includes ports. How can we rectify this?
> 

I may be wrong, since I haven't delved that much into the IPv6 code, but
rt6_multipath_select is nice and clean because it doesn't have to burden
with different weights of the paths.

As for not including the ports, it is for the sole purpose of not
disruption the flow when fragmented packets are received. This is more
likely with IPv4 than with IPv6, since PMTUD is optional with IPv4. In
an ideal world, the IPv6 code shouldn't look at anything but addresses
and flow label either, based on the principle that the router shouldn't
care about L4 and above (but then it shouldn't look at ICMP either, heh)
- but I know this isn't an ideal world and I have no operational
experience with IPv6, so I can't tell whether clients populate the flow
label properly.

Ḯ would argue that L3-based hashing is more than sufficient for
most websites and ISPs, where the number of addresses is high. At least
on the network I have access to, L4 gave very little extra (3%). But I
knew linux users would be demanding L4 hashing despite my beliefs, and
there would probably even be people missing the per-packet multipath.
This is why I started out reintroducing the RTA_MP_ALGO attribute in
my original patch.

To be honest, L4 might almost work in my network which hosts a
few relatively large Danish websites. Fragmentation is only a problem on
clients not doing PMTU (~10%) having large HTTP cookies (very few). But
to these people, they'll have a 50% chance of not being able to access
our sites at all, because packets are distributed to load balancers
which have not been updated with the connection state yet.

My goal is to create the right solution, and to me the right solution
is a solution which doesn't break anything whatsoever. It doesn't cause
out-of-order packets or lost packets just to utilize some links
better. ECMP, Link Aggregation, anycast and load balancers are all
hacks, if you ask me - and these hacks must be careful to not destroy
the illusion that an IP address maps to a single host and the path to
that host is through one cable.

If you all disagree, I'll change it - no problem. Just about anything
is better than the per-packet solution. But I'll have to consider
whether we will be running a modified version of the multipath code in
my network.

Best regards,
 Peter Nørlund

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath
  2015-09-23 21:43     ` Peter Nørlund
@ 2015-09-23 23:12       ` Tom Herbert
  2015-09-23 23:38         ` Peter Nørlund
  0 siblings, 1 reply; 16+ messages in thread
From: Tom Herbert @ 2015-09-23 23:12 UTC (permalink / raw)
  To: Peter Nørlund
  Cc: Linux Kernel Network Developers, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

On Wed, Sep 23, 2015 at 2:43 PM, Peter Nørlund <pch@ordbogen.com> wrote:
> On Wed, 23 Sep 2015 14:00:43 -0700
> Tom Herbert <tom@herbertland.com> wrote:
>
>> On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com>
>> wrote:
>> > Replaces the per-packet multipath with a hash-based multipath using
>> > source and destination address.
>> >
>> It's good that round robin is going away, but this still looks very
>> different with how multipath routing is done done in IPv6
>> (rt6_multipath_select and rt6_info_hash_nhsfn). For instance IPv4
>> hashes addresses, but IPv6 includes ports. How can we rectify this?
>>
>> > Signed-off-by: Peter Nørlund <pch@ordbogen.com>
>> > ---
>> >  include/net/ip_fib.h     |   14 ++++-
>> >  net/ipv4/fib_semantics.c |  140
>> > +++++++++++++++++++++++++---------------------
>> > net/ipv4/route.c         |   16 ++++-- 3 files changed, 98
>> > insertions(+), 72 deletions(-)
>> >
>> > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
>> > index a37d043..c454203 100644
>> > --- a/include/net/ip_fib.h
>> > +++ b/include/net/ip_fib.h
>> > @@ -79,7 +79,7 @@ struct fib_nh {
>> >         unsigned char           nh_scope;
>> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>> >         int                     nh_weight;
>> > -       int                     nh_power;
>> > +       atomic_t                nh_upper_bound;
>> >  #endif
>> >  #ifdef CONFIG_IP_ROUTE_CLASSID
>> >         __u32                   nh_tclassid;
>> > @@ -118,7 +118,7 @@ struct fib_info {
>> >  #define fib_advmss fib_metrics[RTAX_ADVMSS-1]
>> >         int                     fib_nhs;
>> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>> > -       int                     fib_power;
>> > +       int                     fib_weight;
>> >  #endif
>> >         struct rcu_head         rcu;
>> >         struct fib_nh           fib_nh[0];
>> > @@ -312,7 +312,15 @@ int ip_fib_check_default(__be32 gw, struct
>> > net_device *dev); int fib_sync_down_dev(struct net_device *dev,
>> > unsigned long event); int fib_sync_down_addr(struct net *net,
>> > __be32 local); int fib_sync_up(struct net_device *dev, unsigned int
>> > nh_flags); -void fib_select_multipath(struct fib_result *res);
>> > +
>> > +extern u32 fib_multipath_secret __read_mostly;
>> > +
>> > +static inline int fib_multipath_hash(__be32 saddr, __be32 daddr)
>> > +{
>> > +       return jhash_2words(saddr, daddr, fib_multipath_secret) >>
>> > 1;
>>
>> Why the >> 1?
>
> Each nexthop is assigned a range within the hash key space with the
> upper bound specified in nh_upper_bound, but I needed a way to show
> that a nexthop was dead within that same atomic variable. I chose -1
> reducing the actual hash to 31 bits.  I shift the hash because I expect
> it to be slightly faster than AND'ing with 0x7FFFFFFF on most platforms.
>
>>
>> > +}
>> > +
>> > +void fib_select_multipath(struct fib_result *res, int hash);
>> >
>> >  /* Exported by fib_trie.c */
>> >  void fib_trie_init(void);
>> > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
>> > index 064bd3c..0c49d2f 100644
>> > --- a/net/ipv4/fib_semantics.c
>> > +++ b/net/ipv4/fib_semantics.c
>> > @@ -57,8 +57,7 @@ static unsigned int fib_info_cnt;
>> >  static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE];
>> >
>> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>> > -
>> > -static DEFINE_SPINLOCK(fib_multipath_lock);
>> > +u32 fib_multipath_secret __read_mostly;
>> >
>> >  #define for_nexthops(fi)
>> > {                                             \ int nhsel; const
>> > struct fib_nh *nh;                             \ @@ -532,7 +531,67
>> > @@ errout: return ret;
>> >  }
>> >
>> > -#endif
>> > +static void fib_rebalance(struct fib_info *fi)
>> > +{
>> > +       int total;
>> > +       int w;
>> > +       struct in_device *in_dev;
>> > +
>> > +       if (fi->fib_nhs < 2)
>> > +               return;
>> > +
>> > +       total = 0;
>> > +       for_nexthops(fi) {
>> > +               if (nh->nh_flags & RTNH_F_DEAD)
>> > +                       continue;
>> > +
>> > +               in_dev = __in_dev_get_rcu(nh->nh_dev);
>> > +
>> > +               if (in_dev &&
>> > +                   IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
>> > +                   nh->nh_flags & RTNH_F_LINKDOWN)
>> > +                       continue;
>> > +
>> > +               total += nh->nh_weight;
>> > +       } endfor_nexthops(fi);
>> > +
>> > +       w = 0;
>> > +       change_nexthops(fi) {
>> > +               int upper_bound;
>> > +
>> > +               in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev);
>> > +
>> > +               if (nexthop_nh->nh_flags & RTNH_F_DEAD) {
>> > +                       upper_bound = -1;
>> > +               } else if (in_dev &&
>> > +
>> > IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
>> > +                          nexthop_nh->nh_flags & RTNH_F_LINKDOWN) {
>> > +                       upper_bound = -1;
>> > +               } else {
>> > +                       w += nexthop_nh->nh_weight;
>> > +                       upper_bound =
>> > DIV_ROUND_CLOSEST(2147483648LL * w,
>> > +                                                       total) - 1;
>> > +               }
>> > +
>>
>> Is this complexity (aside from recomputing the key) needed because you
>> want consistent hashing for the anycast case?
>
> It is an attempt to reduce the disruption when a path is added or
> removed in a way recommended by RFC 2992. It is not specific to anycast
> at all, although disruptions are more severe for anycast than for
> ordinary TCP which might just slow down due to out of order packets.
>
RFC2992 looks more like a mathematical proof than a practical
algorithm :-) Anyway, the requirement to avoid OOO with TCP is so
often over blown. TCP does not keel over dead with OOO packets, it
just may experience a blip of being suboptimal. As long as we're not
thrashing paths things should be fine quickly, and if we are thrashing
links rather each change affects 1/2 or 1/4 of connections probably
doesn't matter. IMO no one should complain about just doing simple
modulo on the hash.

However, if you're saying that adding a removing a path can cause
complete loss of connectivity for 1/2 of connections using anycast
that is a little more worrisome (this is essentially the same problem
we have with SO_REUSEPORT and TCP). In that case seems like you'd want
the 1/N solution.

Tom

> Traditionally people used modulo to calculate the path from the hash,
> but with the modulo approach 1/2 of all flows would change paths when a
> path is added or removed. With the hash threshold approach, somewhere
> between 1/4 and 1/2 of all flows are affected. So an improvement for
> routes with more than two paths.
>
> Ideally you would have a hash-to-port mapping table or similar,
> ensuring that only 1/N of all flows were affected (with N being the
> number of paths).
>
>>
>> > +               atomic_set(&nexthop_nh->nh_upper_bound,
>> > upper_bound);
>> > +       } endfor_nexthops(fi);
>> > +
>> > +       net_get_random_once(&fib_multipath_secret,
>> > +                           sizeof(fib_multipath_secret));
>> > +}
>> > +
>> > +static inline void fib_add_weight(struct fib_info *fi,
>> > +                                 const struct fib_nh *nh)
>> > +{
>> > +       fi->fib_weight += nh->nh_weight;
>> > +}
>> > +
>> > +#else /* CONFIG_IP_ROUTE_MULTIPATH */
>> > +
>> > +#define fib_rebalance(fi) do { } while (0)
>> > +#define fib_add_weight(fi, nh) do { } while (0)
>> > +
>> > +#endif /* CONFIG_IP_ROUTE_MULTIPATH */
>> >
>> >  static int fib_encap_match(struct net *net, u16 encap_type,
>> >                            struct nlattr *encap,
>> > @@ -1094,8 +1153,11 @@ struct fib_info *fib_create_info(struct
>> > fib_config *cfg)
>> >
>> >         change_nexthops(fi) {
>> >                 fib_info_update_nh_saddr(net, nexthop_nh);
>> > +               fib_add_weight(fi, nexthop_nh);
>> >         } endfor_nexthops(fi)
>> >
>> > +       fib_rebalance(fi);
>> > +
>> >  link_it:
>> >         ofi = fib_find_info(fi);
>> >         if (ofi) {
>> > @@ -1317,12 +1379,6 @@ int fib_sync_down_dev(struct net_device
>> > *dev, unsigned long event) nexthop_nh->nh_flags |= RTNH_F_LINKDOWN;
>> >                                         break;
>> >                                 }
>> > -#ifdef CONFIG_IP_ROUTE_MULTIPATH
>> > -                               spin_lock_bh(&fib_multipath_lock);
>> > -                               fi->fib_power -=
>> > nexthop_nh->nh_power;
>> > -                               nexthop_nh->nh_power = 0;
>> > -                               spin_unlock_bh(&fib_multipath_lock);
>> > -#endif
>> >                                 dead++;
>> >                         }
>> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>> > @@ -1345,6 +1401,8 @@ int fib_sync_down_dev(struct net_device *dev,
>> > unsigned long event) }
>> >                         ret++;
>> >                 }
>> > +
>> > +               fib_rebalance(fi);
>> >         }
>> >
>> >         return ret;
>> > @@ -1467,20 +1525,15 @@ int fib_sync_up(struct net_device *dev,
>> > unsigned int nh_flags) !__in_dev_get_rtnl(dev))
>> >                                 continue;
>> >                         alive++;
>> > -#ifdef CONFIG_IP_ROUTE_MULTIPATH
>> > -                       spin_lock_bh(&fib_multipath_lock);
>> > -                       nexthop_nh->nh_power = 0;
>> > -                       nexthop_nh->nh_flags &= ~nh_flags;
>> > -                       spin_unlock_bh(&fib_multipath_lock);
>> > -#else
>> >                         nexthop_nh->nh_flags &= ~nh_flags;
>> > -#endif
>> >                 } endfor_nexthops(fi)
>> >
>> >                 if (alive > 0) {
>> >                         fi->fib_flags &= ~nh_flags;
>> >                         ret++;
>> >                 }
>> > +
>> > +               fib_rebalance(fi);
>> >         }
>> >
>> >         return ret;
>> > @@ -1488,62 +1541,19 @@ int fib_sync_up(struct net_device *dev,
>> > unsigned int nh_flags)
>> >
>> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>> >
>> > -/*
>> > - * The algorithm is suboptimal, but it provides really
>> > - * fair weighted route distribution.
>> > - */
>> > -void fib_select_multipath(struct fib_result *res)
>> > +void fib_select_multipath(struct fib_result *res, int hash)
>> >  {
>> >         struct fib_info *fi = res->fi;
>> > -       struct in_device *in_dev;
>> > -       int w;
>> > -
>> > -       spin_lock_bh(&fib_multipath_lock);
>> > -       if (fi->fib_power <= 0) {
>> > -               int power = 0;
>> > -               change_nexthops(fi) {
>> > -                       in_dev =
>> > __in_dev_get_rcu(nexthop_nh->nh_dev);
>> > -                       if (nexthop_nh->nh_flags & RTNH_F_DEAD)
>> > -                               continue;
>> > -                       if (in_dev &&
>> > -
>> > IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
>> > -                           nexthop_nh->nh_flags & RTNH_F_LINKDOWN)
>> > -                               continue;
>> > -                       power += nexthop_nh->nh_weight;
>> > -                       nexthop_nh->nh_power =
>> > nexthop_nh->nh_weight;
>> > -               } endfor_nexthops(fi);
>> > -               fi->fib_power = power;
>> > -               if (power <= 0) {
>> > -                       spin_unlock_bh(&fib_multipath_lock);
>> > -                       /* Race condition: route has just become
>> > dead. */
>> > -                       res->nh_sel = 0;
>> > -                       return;
>> > -               }
>> > -       }
>> > -
>> >
>> > -       /* w should be random number [0..fi->fib_power-1],
>> > -        * it is pretty bad approximation.
>> > -        */
>> > -
>> > -       w = jiffies % fi->fib_power;
>> > +       for_nexthops(fi) {
>> > +               if (hash > atomic_read(&nh->nh_upper_bound))
>> > +                       continue;
>> >
>> > -       change_nexthops(fi) {
>> > -               if (!(nexthop_nh->nh_flags & RTNH_F_DEAD) &&
>> > -                   nexthop_nh->nh_power) {
>> > -                       w -= nexthop_nh->nh_power;
>> > -                       if (w <= 0) {
>> > -                               nexthop_nh->nh_power--;
>> > -                               fi->fib_power--;
>> > -                               res->nh_sel = nhsel;
>> > -                               spin_unlock_bh(&fib_multipath_lock);
>> > -                               return;
>> > -                       }
>> > -               }
>> > +               res->nh_sel = nhsel;
>> > +               return;
>> >         } endfor_nexthops(fi);
>> >
>> >         /* Race condition: route has just become dead. */
>> >         res->nh_sel = 0;
>> > -       spin_unlock_bh(&fib_multipath_lock);
>> >  }
>> >  #endif
>> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
>> > index 80f7c5b..3dc83b6 100644
>> > --- a/net/ipv4/route.c
>> > +++ b/net/ipv4/route.c
>> > @@ -1659,8 +1659,12 @@ static int ip_mkroute_input(struct sk_buff
>> > *skb, __be32 daddr, __be32 saddr, u32 tos)
>> >  {
>> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>> > -       if (res->fi && res->fi->fib_nhs > 1)
>> > -               fib_select_multipath(res);
>> > +       if (res->fi && res->fi->fib_nhs > 1) {
>> > +               int h;
>> > +
>> > +               h = fib_multipath_hash(saddr, daddr);
>> > +               fib_select_multipath(res, h);
>> > +       }
>> >  #endif
>> >
>> >         /* create a routing cache entry */
>> > @@ -2190,8 +2194,12 @@ struct rtable *__ip_route_output_key(struct
>> > net *net, struct flowi4 *fl4) }
>> >
>> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>> > -       if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0)
>> > -               fib_select_multipath(&res);
>> > +       if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) {
>> > +               int h;
>> > +
>> > +               h = fib_multipath_hash(fl4->saddr, fl4->daddr);
>> > +               fib_select_multipath(&res, h);
>> > +       }
>> >         else
>> >  #endif
>> >         if (!res.prefixlen &&
>> > --
>> > 1.7.10.4
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe netdev" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath
  2015-09-23 23:09     ` Peter Nørlund
@ 2015-09-23 23:37       ` Tom Herbert
  0 siblings, 0 replies; 16+ messages in thread
From: Tom Herbert @ 2015-09-23 23:37 UTC (permalink / raw)
  To: Peter Nørlund
  Cc: Linux Kernel Network Developers, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

On Wed, Sep 23, 2015 at 4:09 PM, Peter Nørlund <pch@ordbogen.com> wrote:
> On Wed, 23 Sep 2015 14:00:43 -0700
> Tom Herbert <tom@herbertland.com> wrote:
>
>> On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com>
>> wrote:
>> > Replaces the per-packet multipath with a hash-based multipath using
>> > source and destination address.
>> >
>> It's good that round robin is going away, but this still looks very
>> different with how multipath routing is done done in IPv6
>> (rt6_multipath_select and rt6_info_hash_nhsfn). For instance IPv4
>> hashes addresses, but IPv6 includes ports. How can we rectify this?
>>
>
> I may be wrong, since I haven't delved that much into the IPv6 code, but
> rt6_multipath_select is nice and clean because it doesn't have to burden
> with different weights of the paths.
>
> As for not including the ports, it is for the sole purpose of not
> disruption the flow when fragmented packets are received. This is more
> likely with IPv4 than with IPv6, since PMTUD is optional with IPv4. In
> an ideal world, the IPv6 code shouldn't look at anything but addresses
> and flow label either, based on the principle that the router shouldn't
> care about L4 and above (but then it shouldn't look at ICMP either, heh)
> - but I know this isn't an ideal world and I have no operational
> experience with IPv6, so I can't tell whether clients populate the flow
> label properly.
>
> Ḯ would argue that L3-based hashing is more than sufficient for
> most websites and ISPs, where the number of addresses is high. At least
> on the network I have access to, L4 gave very little extra (3%). But I
> knew linux users would be demanding L4 hashing despite my beliefs, and
> there would probably even be people missing the per-packet multipath.
> This is why I started out reintroducing the RTA_MP_ALGO attribute in
> my original patch.
>
L4 versus L3 hashing is not my primary concern. It is the glaring
inconsistency between IPv4 and IPv6. If we have fundamentally
different behaviors between these versions of IP this can create (and
has created) headaches for users running IPv4 and IPv6 networks (which
is now basically the Internet). All of your points for why L3 is
better that L4 hashing for IPv4 should apply to IPv6 given current
state of flow label support. So not try for a unified solution? Either
both should just use L3 hashing, or both should allow configurable use
of L3 or L4 hashing.

Thanks,
Tom

> To be honest, L4 might almost work in my network which hosts a
> few relatively large Danish websites. Fragmentation is only a problem on
> clients not doing PMTU (~10%) having large HTTP cookies (very few). But
> to these people, they'll have a 50% chance of not being able to access
> our sites at all, because packets are distributed to load balancers
> which have not been updated with the connection state yet.
>
> My goal is to create the right solution, and to me the right solution
> is a solution which doesn't break anything whatsoever. It doesn't cause
> out-of-order packets or lost packets just to utilize some links
> better. ECMP, Link Aggregation, anycast and load balancers are all
> hacks, if you ask me - and these hacks must be careful to not destroy
> the illusion that an IP address maps to a single host and the path to
> that host is through one cable.
>
> If you all disagree, I'll change it - no problem. Just about anything
> is better than the per-packet solution. But I'll have to consider
> whether we will be running a modified version of the multipath code in
> my network.
>
> Best regards,
>  Peter Nørlund

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath
  2015-09-23 23:12       ` Tom Herbert
@ 2015-09-23 23:38         ` Peter Nørlund
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Nørlund @ 2015-09-23 23:38 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy

On Wed, 23 Sep 2015 16:12:42 -0700
Tom Herbert <tom@herbertland.com> wrote:

> On Wed, Sep 23, 2015 at 2:43 PM, Peter Nørlund <pch@ordbogen.com>
> wrote:
> > On Wed, 23 Sep 2015 14:00:43 -0700
> > Tom Herbert <tom@herbertland.com> wrote:
> >
> >> On Wed, Sep 23, 2015 at 12:49 PM, Peter Nørlund <pch@ordbogen.com>
> >> wrote:
> >> > Replaces the per-packet multipath with a hash-based multipath
> >> > using source and destination address.
> >> >
> >> It's good that round robin is going away, but this still looks very
> >> different with how multipath routing is done done in IPv6
> >> (rt6_multipath_select and rt6_info_hash_nhsfn). For instance IPv4
> >> hashes addresses, but IPv6 includes ports. How can we rectify this?
> >>
> >> > Signed-off-by: Peter Nørlund <pch@ordbogen.com>
> >> > ---
> >> >  include/net/ip_fib.h     |   14 ++++-
> >> >  net/ipv4/fib_semantics.c |  140
> >> > +++++++++++++++++++++++++---------------------
> >> > net/ipv4/route.c         |   16 ++++-- 3 files changed, 98
> >> > insertions(+), 72 deletions(-)
> >> >
> >> > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> >> > index a37d043..c454203 100644
> >> > --- a/include/net/ip_fib.h
> >> > +++ b/include/net/ip_fib.h
> >> > @@ -79,7 +79,7 @@ struct fib_nh {
> >> >         unsigned char           nh_scope;
> >> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> >> >         int                     nh_weight;
> >> > -       int                     nh_power;
> >> > +       atomic_t                nh_upper_bound;
> >> >  #endif
> >> >  #ifdef CONFIG_IP_ROUTE_CLASSID
> >> >         __u32                   nh_tclassid;
> >> > @@ -118,7 +118,7 @@ struct fib_info {
> >> >  #define fib_advmss fib_metrics[RTAX_ADVMSS-1]
> >> >         int                     fib_nhs;
> >> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> >> > -       int                     fib_power;
> >> > +       int                     fib_weight;
> >> >  #endif
> >> >         struct rcu_head         rcu;
> >> >         struct fib_nh           fib_nh[0];
> >> > @@ -312,7 +312,15 @@ int ip_fib_check_default(__be32 gw, struct
> >> > net_device *dev); int fib_sync_down_dev(struct net_device *dev,
> >> > unsigned long event); int fib_sync_down_addr(struct net *net,
> >> > __be32 local); int fib_sync_up(struct net_device *dev, unsigned
> >> > int nh_flags); -void fib_select_multipath(struct fib_result
> >> > *res); +
> >> > +extern u32 fib_multipath_secret __read_mostly;
> >> > +
> >> > +static inline int fib_multipath_hash(__be32 saddr, __be32 daddr)
> >> > +{
> >> > +       return jhash_2words(saddr, daddr, fib_multipath_secret)
> >> > >> 1;
> >>
> >> Why the >> 1?
> >
> > Each nexthop is assigned a range within the hash key space with the
> > upper bound specified in nh_upper_bound, but I needed a way to show
> > that a nexthop was dead within that same atomic variable. I chose -1
> > reducing the actual hash to 31 bits.  I shift the hash because I
> > expect it to be slightly faster than AND'ing with 0x7FFFFFFF on
> > most platforms.
> >
> >>
> >> > +}
> >> > +
> >> > +void fib_select_multipath(struct fib_result *res, int hash);
> >> >
> >> >  /* Exported by fib_trie.c */
> >> >  void fib_trie_init(void);
> >> > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> >> > index 064bd3c..0c49d2f 100644
> >> > --- a/net/ipv4/fib_semantics.c
> >> > +++ b/net/ipv4/fib_semantics.c
> >> > @@ -57,8 +57,7 @@ static unsigned int fib_info_cnt;
> >> >  static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE];
> >> >
> >> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> >> > -
> >> > -static DEFINE_SPINLOCK(fib_multipath_lock);
> >> > +u32 fib_multipath_secret __read_mostly;
> >> >
> >> >  #define for_nexthops(fi)
> >> > {                                             \ int nhsel; const
> >> > struct fib_nh *nh;                             \ @@ -532,7
> >> > +531,67 @@ errout: return ret;
> >> >  }
> >> >
> >> > -#endif
> >> > +static void fib_rebalance(struct fib_info *fi)
> >> > +{
> >> > +       int total;
> >> > +       int w;
> >> > +       struct in_device *in_dev;
> >> > +
> >> > +       if (fi->fib_nhs < 2)
> >> > +               return;
> >> > +
> >> > +       total = 0;
> >> > +       for_nexthops(fi) {
> >> > +               if (nh->nh_flags & RTNH_F_DEAD)
> >> > +                       continue;
> >> > +
> >> > +               in_dev = __in_dev_get_rcu(nh->nh_dev);
> >> > +
> >> > +               if (in_dev &&
> >> > +                   IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
> >> > +                   nh->nh_flags & RTNH_F_LINKDOWN)
> >> > +                       continue;
> >> > +
> >> > +               total += nh->nh_weight;
> >> > +       } endfor_nexthops(fi);
> >> > +
> >> > +       w = 0;
> >> > +       change_nexthops(fi) {
> >> > +               int upper_bound;
> >> > +
> >> > +               in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev);
> >> > +
> >> > +               if (nexthop_nh->nh_flags & RTNH_F_DEAD) {
> >> > +                       upper_bound = -1;
> >> > +               } else if (in_dev &&
> >> > +
> >> > IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
> >> > +                          nexthop_nh->nh_flags &
> >> > RTNH_F_LINKDOWN) {
> >> > +                       upper_bound = -1;
> >> > +               } else {
> >> > +                       w += nexthop_nh->nh_weight;
> >> > +                       upper_bound =
> >> > DIV_ROUND_CLOSEST(2147483648LL * w,
> >> > +                                                       total) -
> >> > 1;
> >> > +               }
> >> > +
> >>
> >> Is this complexity (aside from recomputing the key) needed because
> >> you want consistent hashing for the anycast case?
> >
> > It is an attempt to reduce the disruption when a path is added or
> > removed in a way recommended by RFC 2992. It is not specific to
> > anycast at all, although disruptions are more severe for anycast
> > than for ordinary TCP which might just slow down due to out of
> > order packets.
> >
> RFC2992 looks more like a mathematical proof than a practical
> algorithm :-) Anyway, the requirement to avoid OOO with TCP is so
> often over blown. TCP does not keel over dead with OOO packets, it
> just may experience a blip of being suboptimal. As long as we're not
> thrashing paths things should be fine quickly, and if we are thrashing
> links rather each change affects 1/2 or 1/4 of connections probably
> doesn't matter. IMO no one should complain about just doing simple
> modulo on the hash.
> 
> However, if you're saying that adding a removing a path can cause
> complete loss of connectivity for 1/2 of connections using anycast
> that is a little more worrisome (this is essentially the same problem
> we have with SO_REUSEPORT and TCP). In that case seems like you'd want
> the 1/N solution.
> 
> Tom
> 

I guess it depends on how you use anycast. If you use TCP and each path
would hit different servers which doesn't share state at all, things
would break. If you're having a pair of load-balancers behind the
router which synchronizes with each other (like IPVS), then only the
very new connections would fail (the connections which haven't been
synchronized yet).

If you only have two paths, then you can't get any better than 1/2 no
matter what algorithm you choose (all flows on the bad path will have
to move, which may cause issues depending on the setup).

It is actually still better than IPVS in an active/backup scenario,
since a crashing load balancer might cause issues for ALL flows.

I guess we are getting closer and closer to L4 with modulo after all -
at least it is simpler.

Best regards,
 Peter Nørlund

> > Traditionally people used modulo to calculate the path from the
> > hash, but with the modulo approach 1/2 of all flows would change
> > paths when a path is added or removed. With the hash threshold
> > approach, somewhere between 1/4 and 1/2 of all flows are affected.
> > So an improvement for routes with more than two paths.
> >
> > Ideally you would have a hash-to-port mapping table or similar,
> > ensuring that only 1/N of all flows were affected (with N being the
> > number of paths).
> >
> >>
> >> > +               atomic_set(&nexthop_nh->nh_upper_bound,
> >> > upper_bound);
> >> > +       } endfor_nexthops(fi);
> >> > +
> >> > +       net_get_random_once(&fib_multipath_secret,
> >> > +                           sizeof(fib_multipath_secret));
> >> > +}
> >> > +
> >> > +static inline void fib_add_weight(struct fib_info *fi,
> >> > +                                 const struct fib_nh *nh)
> >> > +{
> >> > +       fi->fib_weight += nh->nh_weight;
> >> > +}
> >> > +
> >> > +#else /* CONFIG_IP_ROUTE_MULTIPATH */
> >> > +
> >> > +#define fib_rebalance(fi) do { } while (0)
> >> > +#define fib_add_weight(fi, nh) do { } while (0)
> >> > +
> >> > +#endif /* CONFIG_IP_ROUTE_MULTIPATH */
> >> >
> >> >  static int fib_encap_match(struct net *net, u16 encap_type,
> >> >                            struct nlattr *encap,
> >> > @@ -1094,8 +1153,11 @@ struct fib_info *fib_create_info(struct
> >> > fib_config *cfg)
> >> >
> >> >         change_nexthops(fi) {
> >> >                 fib_info_update_nh_saddr(net, nexthop_nh);
> >> > +               fib_add_weight(fi, nexthop_nh);
> >> >         } endfor_nexthops(fi)
> >> >
> >> > +       fib_rebalance(fi);
> >> > +
> >> >  link_it:
> >> >         ofi = fib_find_info(fi);
> >> >         if (ofi) {
> >> > @@ -1317,12 +1379,6 @@ int fib_sync_down_dev(struct net_device
> >> > *dev, unsigned long event) nexthop_nh->nh_flags |=
> >> > RTNH_F_LINKDOWN; break;
> >> >                                 }
> >> > -#ifdef CONFIG_IP_ROUTE_MULTIPATH
> >> > -
> >> > spin_lock_bh(&fib_multipath_lock);
> >> > -                               fi->fib_power -=
> >> > nexthop_nh->nh_power;
> >> > -                               nexthop_nh->nh_power = 0;
> >> > -
> >> > spin_unlock_bh(&fib_multipath_lock); -#endif
> >> >                                 dead++;
> >> >                         }
> >> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> >> > @@ -1345,6 +1401,8 @@ int fib_sync_down_dev(struct net_device
> >> > *dev, unsigned long event) }
> >> >                         ret++;
> >> >                 }
> >> > +
> >> > +               fib_rebalance(fi);
> >> >         }
> >> >
> >> >         return ret;
> >> > @@ -1467,20 +1525,15 @@ int fib_sync_up(struct net_device *dev,
> >> > unsigned int nh_flags) !__in_dev_get_rtnl(dev))
> >> >                                 continue;
> >> >                         alive++;
> >> > -#ifdef CONFIG_IP_ROUTE_MULTIPATH
> >> > -                       spin_lock_bh(&fib_multipath_lock);
> >> > -                       nexthop_nh->nh_power = 0;
> >> > -                       nexthop_nh->nh_flags &= ~nh_flags;
> >> > -                       spin_unlock_bh(&fib_multipath_lock);
> >> > -#else
> >> >                         nexthop_nh->nh_flags &= ~nh_flags;
> >> > -#endif
> >> >                 } endfor_nexthops(fi)
> >> >
> >> >                 if (alive > 0) {
> >> >                         fi->fib_flags &= ~nh_flags;
> >> >                         ret++;
> >> >                 }
> >> > +
> >> > +               fib_rebalance(fi);
> >> >         }
> >> >
> >> >         return ret;
> >> > @@ -1488,62 +1541,19 @@ int fib_sync_up(struct net_device *dev,
> >> > unsigned int nh_flags)
> >> >
> >> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> >> >
> >> > -/*
> >> > - * The algorithm is suboptimal, but it provides really
> >> > - * fair weighted route distribution.
> >> > - */
> >> > -void fib_select_multipath(struct fib_result *res)
> >> > +void fib_select_multipath(struct fib_result *res, int hash)
> >> >  {
> >> >         struct fib_info *fi = res->fi;
> >> > -       struct in_device *in_dev;
> >> > -       int w;
> >> > -
> >> > -       spin_lock_bh(&fib_multipath_lock);
> >> > -       if (fi->fib_power <= 0) {
> >> > -               int power = 0;
> >> > -               change_nexthops(fi) {
> >> > -                       in_dev =
> >> > __in_dev_get_rcu(nexthop_nh->nh_dev);
> >> > -                       if (nexthop_nh->nh_flags & RTNH_F_DEAD)
> >> > -                               continue;
> >> > -                       if (in_dev &&
> >> > -
> >> > IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
> >> > -                           nexthop_nh->nh_flags &
> >> > RTNH_F_LINKDOWN)
> >> > -                               continue;
> >> > -                       power += nexthop_nh->nh_weight;
> >> > -                       nexthop_nh->nh_power =
> >> > nexthop_nh->nh_weight;
> >> > -               } endfor_nexthops(fi);
> >> > -               fi->fib_power = power;
> >> > -               if (power <= 0) {
> >> > -                       spin_unlock_bh(&fib_multipath_lock);
> >> > -                       /* Race condition: route has just become
> >> > dead. */
> >> > -                       res->nh_sel = 0;
> >> > -                       return;
> >> > -               }
> >> > -       }
> >> > -
> >> >
> >> > -       /* w should be random number [0..fi->fib_power-1],
> >> > -        * it is pretty bad approximation.
> >> > -        */
> >> > -
> >> > -       w = jiffies % fi->fib_power;
> >> > +       for_nexthops(fi) {
> >> > +               if (hash > atomic_read(&nh->nh_upper_bound))
> >> > +                       continue;
> >> >
> >> > -       change_nexthops(fi) {
> >> > -               if (!(nexthop_nh->nh_flags & RTNH_F_DEAD) &&
> >> > -                   nexthop_nh->nh_power) {
> >> > -                       w -= nexthop_nh->nh_power;
> >> > -                       if (w <= 0) {
> >> > -                               nexthop_nh->nh_power--;
> >> > -                               fi->fib_power--;
> >> > -                               res->nh_sel = nhsel;
> >> > -
> >> > spin_unlock_bh(&fib_multipath_lock);
> >> > -                               return;
> >> > -                       }
> >> > -               }
> >> > +               res->nh_sel = nhsel;
> >> > +               return;
> >> >         } endfor_nexthops(fi);
> >> >
> >> >         /* Race condition: route has just become dead. */
> >> >         res->nh_sel = 0;
> >> > -       spin_unlock_bh(&fib_multipath_lock);
> >> >  }
> >> >  #endif
> >> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> >> > index 80f7c5b..3dc83b6 100644
> >> > --- a/net/ipv4/route.c
> >> > +++ b/net/ipv4/route.c
> >> > @@ -1659,8 +1659,12 @@ static int ip_mkroute_input(struct sk_buff
> >> > *skb, __be32 daddr, __be32 saddr, u32 tos)
> >> >  {
> >> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> >> > -       if (res->fi && res->fi->fib_nhs > 1)
> >> > -               fib_select_multipath(res);
> >> > +       if (res->fi && res->fi->fib_nhs > 1) {
> >> > +               int h;
> >> > +
> >> > +               h = fib_multipath_hash(saddr, daddr);
> >> > +               fib_select_multipath(res, h);
> >> > +       }
> >> >  #endif
> >> >
> >> >         /* create a routing cache entry */
> >> > @@ -2190,8 +2194,12 @@ struct rtable
> >> > *__ip_route_output_key(struct net *net, struct flowi4 *fl4) }
> >> >
> >> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> >> > -       if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0)
> >> > -               fib_select_multipath(&res);
> >> > +       if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) {
> >> > +               int h;
> >> > +
> >> > +               h = fib_multipath_hash(fl4->saddr, fl4->daddr);
> >> > +               fib_select_multipath(&res, h);
> >> > +       }
> >> >         else
> >> >  #endif
> >> >         if (!res.prefixlen &&
> >> > --
> >> > 1.7.10.4
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe
> >> > netdev" in the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at
> >> > http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing
  2015-09-23 19:49 [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing Peter Nørlund
  2015-09-23 19:49 ` [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath Peter Nørlund
  2015-09-23 19:49 ` [PATCH v4 net-next 2/2] ipv4: ICMP packet inspection for multipath Peter Nørlund
@ 2015-09-29  2:33 ` David Miller
  2015-09-29  2:55   ` David Miller
  2 siblings, 1 reply; 16+ messages in thread
From: David Miller @ 2015-09-29  2:33 UTC (permalink / raw)
  To: pch; +Cc: netdev, kuznet, jmorris, yoshfuji, kaber

From: Peter Nørlund <pch@ordbogen.com>
Date: Wed, 23 Sep 2015 21:49:35 +0200

> When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed
> from more or less being destination-based into being quasi-random per-packet
> scheduling. This increases the risk of out-of-order packets and makes it
> impossible to use multipath together with anycast services.
> 
> This patch series replaces the old implementation with flow-based load
> balancing based on a hash over the source and destination addresses.

This isn't perfect but it's a significant step in the right direction.
So I'm going to apply this to net-next now and we can make incremental
improvements upon it.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing
  2015-09-29  2:33 ` [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing David Miller
@ 2015-09-29  2:55   ` David Miller
  2015-09-29 11:29     ` Peter Nørlund
  0 siblings, 1 reply; 16+ messages in thread
From: David Miller @ 2015-09-29  2:55 UTC (permalink / raw)
  To: pch; +Cc: netdev, kuznet, jmorris, yoshfuji, kaber

From: David Miller <davem@davemloft.net>
Date: Mon, 28 Sep 2015 19:33:55 -0700 (PDT)

> From: Peter Nørlund <pch@ordbogen.com>
> Date: Wed, 23 Sep 2015 21:49:35 +0200
> 
>> When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed
>> from more or less being destination-based into being quasi-random per-packet
>> scheduling. This increases the risk of out-of-order packets and makes it
>> impossible to use multipath together with anycast services.
>> 
>> This patch series replaces the old implementation with flow-based load
>> balancing based on a hash over the source and destination addresses.
> 
> This isn't perfect but it's a significant step in the right direction.
> So I'm going to apply this to net-next now and we can make incremental
> improvements upon it.

Actually, I had to revert, this doesn't build:

[davem@localhost net-next]$ make -s -j8
Setup is 16876 bytes (padded to 16896 bytes).
System is 10011 kB
CRC 324f2811
Kernel: arch/x86/boot/bzImage is ready  (#337)
ERROR: "__ip_route_output_key_hash" [net/dccp/dccp_ipv4.ko] undefined!
scripts/Makefile.modpost:90: recipe for target '__modpost' failed
make[1]: *** [__modpost] Error 1
Makefile:1095: recipe for target 'modules' failed
make: *** [modules] Error 2

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing
  2015-09-29  2:55   ` David Miller
@ 2015-09-29 11:29     ` Peter Nørlund
  2015-09-29 12:38       ` David Laight
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Nørlund @ 2015-09-29 11:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, kuznet, jmorris, yoshfuji, kaber

On Mon, 28 Sep 2015 19:55:41 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: David Miller <davem@davemloft.net>
> Date: Mon, 28 Sep 2015 19:33:55 -0700 (PDT)
> 
> > From: Peter Nørlund <pch@ordbogen.com>
> > Date: Wed, 23 Sep 2015 21:49:35 +0200
> > 
> >> When the routing cache was removed in 3.6, the IPv4 multipath
> >> algorithm changed from more or less being destination-based into
> >> being quasi-random per-packet scheduling. This increases the risk
> >> of out-of-order packets and makes it impossible to use multipath
> >> together with anycast services.
> >> 
> >> This patch series replaces the old implementation with flow-based
> >> load balancing based on a hash over the source and destination
> >> addresses.
> > 
> > This isn't perfect but it's a significant step in the right
> > direction. So I'm going to apply this to net-next now and we can
> > make incremental improvements upon it.
> 
> Actually, I had to revert, this doesn't build:
> 
> [davem@localhost net-next]$ make -s -j8
> Setup is 16876 bytes (padded to 16896 bytes).
> System is 10011 kB
> CRC 324f2811
> Kernel: arch/x86/boot/bzImage is ready  (#337)
> ERROR: "__ip_route_output_key_hash" [net/dccp/dccp_ipv4.ko] undefined!
> scripts/Makefile.modpost:90: recipe for target '__modpost' failed
> make[1]: *** [__modpost] Error 1
> Makefile:1095: recipe for target 'modules' failed
> make: *** [modules] Error 2

Sorry! I forgot to update the EXPORT_SYMBOL_GPL line.

In the meantime I've been doing some thinking (and measuring).
Considering that the broader goal is to make IPv6 and IPv4 behave as
identical as possible, it is probably not such a bad idea to just use
the flow dissector + modulo in the IPv4 code too - the patch will be
simpler than the current one.

I fear the performance impact of the flow dissector though - some of my
earlier measurements showed that it was 5-6 times slower than the
simple one I used. But maybe it is better to streamline the IPv4/IPv6
multipath first and then improve upon it afterward (make it work, make
it right, make it fast).

As for using L4 hashing with anycast, CloudFlare apparently does L4
hashing - they could have disabled it, but they didn't. Besides,
analysis of my own load balancers showed that only one in every
500,000,000 packets is fragmented. And even if I hit a fragmented
packet, it is only a problem if the packet hits the wrong load
balancer, and if that load balancer haven't been updated with the state
from another load balancer (that is, one of the very first packets). It
is still a possible scenario though - especially with large HTTP
cookies or file uploads. But apparently it is a common problem that IP
fragments gets dropped on the Internet, so I suspect that ECMP+Anycast
sites are just part of the pool of problematic sites for people with
fragments.

I'm still unsettled as to whether the ICMP handling belongs to the
kernel or not. The above breakage was in the ICMP-part of the
patchset, so judging from that, I guess it wasn't out of the question.
But in the "IPv4 and IPv6 should behave identical"-mindset, it probably
belongs to a separate, future patchset, adding ICMP handling to both
IPv4 and IPv6 - and it is actually more important for IPv6 than IPv4
since PMTUD cannot be disabled.

Best Regards,
  Peter Nørlund

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing
  2015-09-29 11:29     ` Peter Nørlund
@ 2015-09-29 12:38       ` David Laight
  0 siblings, 0 replies; 16+ messages in thread
From: David Laight @ 2015-09-29 12:38 UTC (permalink / raw)
  To: 'Peter Nørlund', David Miller
  Cc: netdev@vger.kernel.org, kuznet@ms2.inr.ac.ru, jmorris@namei.org,
	yoshfuji@linux-ipv6.org, kaber@trash.net

From: Peter Nørlund
> Sent: 29 September 2015 12:29
...
> As for using L4 hashing with anycast, CloudFlare apparently does L4
> hashing - they could have disabled it, but they didn't. Besides,
> analysis of my own load balancers showed that only one in every
> 500,000,000 packets is fragmented. And even if I hit a fragmented
> packet, it is only a problem if the packet hits the wrong load
> balancer, and if that load balancer haven't been updated with the state
> from another load balancer (that is, one of the very first packets). It
> is still a possible scenario though - especially with large HTTP
> cookies or file uploads. But apparently it is a common problem that IP
> fragments gets dropped on the Internet, so I suspect that ECMP+Anycast
> sites are just part of the pool of problematic sites for people with
> fragments.

Fragmentation is usually more of an issue with UDP than TCP.
Some SIP messages can get fragmented...

	David


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-09-29 12:40 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-23 19:49 [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing Peter Nørlund
2015-09-23 19:49 ` [PATCH v4 net-next 1/2] ipv4: L3 hash-based multipath Peter Nørlund
2015-09-23 21:00   ` Tom Herbert
2015-09-23 21:43     ` Peter Nørlund
2015-09-23 23:12       ` Tom Herbert
2015-09-23 23:38         ` Peter Nørlund
2015-09-23 23:09     ` Peter Nørlund
2015-09-23 23:37       ` Tom Herbert
2015-09-23 21:55   ` David Miller
2015-09-23 19:49 ` [PATCH v4 net-next 2/2] ipv4: ICMP packet inspection for multipath Peter Nørlund
2015-09-23 21:01   ` Tom Herbert
2015-09-23 22:07     ` Peter Nørlund
2015-09-29  2:33 ` [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing David Miller
2015-09-29  2:55   ` David Miller
2015-09-29 11:29     ` Peter Nørlund
2015-09-29 12:38       ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).