[PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing
@ 2015-08-28 20:00 pch
  2015-08-28 20:00 ` [PATCH v2 net-next 3/3] ipv4: ICMP packet inspection for L3 multipath pch
       [not found] ` <1440792050-2109-1-git-send-email-pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org>
  0 siblings, 2 replies; 13+ messages in thread
From: pch @ 2015-08-28 20:00 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, linux-api, Roopa Prabhu,
	Scott Feldman, Eric W. Biederman, Nicolas Dichtel, Thomas Graf,
	Jiri Benc, Peter Nørlund

When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed
from more or less being destination-based into being quasi-random per-packet
scheduling. This increases the risk of out-of-order packets and makes it
impossible to use multipath together with anycast services.

This patch series seeks to extend the multipath system to support both L3 and
L4-based multipath while still supporting per-packet multipath.

The multipath algorithm is set as a per-route attribute (RTA_MP_ALGO) with some
degree of binary compatibility with the old implementation (2.6.12 - 2.6.22),
but without source level compatibility since attributes have different names:

RT_MP_ALG_L3_HASH:
L3 hash-based distribution. This was IP_MP_ALG_NONE, which with the route
cached behaved somewhat like L3-based distribution. This is the new default.

RT_MP_ALG_PER_PACKET:
Per-packet distribution. Was IP_MP_ALG_RR. Uses round-robin.

RT_MP_ALG_DRR, RT_MP_ALG_RANDOM, RT_MP_ALG_WRANDOM:
Unsupported values, but reserved because they existed in 2.6.12 - 2.6.22.

RT_MP_ALG_L4_HASH:
L4 hash-based distribution. This is new.

The traditional modulo approach is replaced by a threshold-based approach,
described in RFC 2992. This reduces disruption in case of link failures or
route changes.

To better support anycast environments where PMTU usually breaks with
multipath, certain ICMP packets are hashed using the IP addresses within the
ICMP payload when using L3 hashing. This ensures that ICMP packets are routed
over the same path as the flow they belong to. It is not enabled with L4
hashing, since we can only consistently rely on L4 information, when PMTU is
used, and PMTU may be used in one direction while not being used in the other.

As a side effect, the multipath spinlock was removed and the code got faster.
I measured ip_mkroute_input (excl. __mkroute_input) on a Xeon X3350 (4 cores,
2.66GHz) with two paths:

Old per-packet: ~393.9 cycles(tsc)
New per-packet:  ~75.2 cycles(tsc)
New L3:         ~107.9 cycles(tsc)
New L4:         ~129.1 cycles(tsc)

The timings are approximately the same with a single core, except for the old
per-packet which gets faster (~199.8 cycles) most likely because there is no
contention on the spinlock.

If this patch is accepted, a follow-up patch to iproute2 will also be
submitted.

Changes in v2:
- Replaced 8-bit xor hash with 31-bit jenkins hash
- Don't scale weights (since 31-bit)
- Avoided unnecesary renaming of variables
- Rely on DF-bit instead of fragment offset when checking for fragmentation
- upper_bound is now inclusive to avoid overflow
- Use a callback to postpone extracting flow information until necessary
- Skipped ICMP inspection entirely with L4 hashing
- Handle newly added sysctl ignore_routes_with_linkdown

Best Regards,
 Peter Nørlund

Peter Nørlund (3):
      ipv4: Lock-less per-packet multipath
      ipv4: L3 and L4 hash-based multipath routing
      ipv4: ICMP packet inspection for L3 multipath

 include/net/ip_fib.h           |  26 ++++++-
 include/net/route.h            |  12 ++-
 include/uapi/linux/rtnetlink.h |  14 +++-
 net/ipv4/Kconfig               |   1 +
 net/ipv4/fib_frontend.c        |   4 +
 net/ipv4/fib_semantics.c       | 168 ++++++++++++++++++++++++++---------------
 net/ipv4/icmp.c                |  34 ++++++++-
 net/ipv4/route.c               | 112 +++++++++++++++++++++++++--
 8 files changed, 298 insertions(+), 73 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 net-next 1/3] ipv4: Lock-less per-packet multipath
       [not found] ` <1440792050-2109-1-git-send-email-pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org>
@ 2015-08-28 20:00   ` pch-chEQUL3jiZBWk0Htik3J/w
  2015-08-28 20:00   ` [PATCH v2 net-next 2/3] ipv4: L3 and L4 hash-based multipath routing pch-chEQUL3jiZBWk0Htik3J/w
  2015-08-29 20:14   ` [PATCH v2 net-next 0/3] ipv4: Hash-based " David Miller
  2 siblings, 0 replies; 13+ messages in thread
From: pch-chEQUL3jiZBWk0Htik3J/w @ 2015-08-28 20:00 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Roopa Prabhu, Scott Feldman,
	Eric W. Biederman, Nicolas Dichtel, Thomas Graf, Jiri Benc,
	Peter Nørlund

From: Peter Nørlund <pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org>

The current multipath attempted to be quasi random, but in most cases it
behaved just like a round robin balancing. This patch refactors the
algorithm to be exactly that and in doing so, avoids the spin lock.

The new design paves the way for hash-based multipath, replacing the
modulo with thresholds, minimizing disruption in case of failing paths or
route replacements.

Signed-off-by: Peter Nørlund <pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org>
---
 include/net/ip_fib.h     |   4 +-
 net/ipv4/Kconfig         |   1 +
 net/ipv4/fib_semantics.c | 129 ++++++++++++++++++++++++++---------------------
 3 files changed, 74 insertions(+), 60 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index a37d043..18a3c7f 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -79,7 +79,7 @@ struct fib_nh {
 	unsigned char		nh_scope;
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	int			nh_weight;
-	int			nh_power;
+	atomic_t		nh_upper_bound;
 #endif
 #ifdef CONFIG_IP_ROUTE_CLASSID
 	__u32			nh_tclassid;
@@ -118,7 +118,7 @@ struct fib_info {
 #define fib_advmss fib_metrics[RTAX_ADVMSS-1]
 	int			fib_nhs;
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-	int			fib_power;
+	int			fib_weight;
 #endif
 	struct rcu_head		rcu;
 	struct fib_nh		fib_nh[0];
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 416dfa0..23d8b2b 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -81,6 +81,7 @@ config IP_MULTIPLE_TABLES
 config IP_ROUTE_MULTIPATH
 	bool "IP: equal cost multipath"
 	depends on IP_ADVANCED_ROUTER
+	select BITREVERSE
 	help
 	  Normally, the routing tables specify a single action to be taken in
 	  a deterministic manner for a given packet. If you say Y here
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 1b2d011..becb63f 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -15,6 +15,7 @@
 
 #include <asm/uaccess.h>
 #include <linux/bitops.h>
+#include <linux/bitrev.h>
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <linux/jiffies.h>
@@ -58,7 +59,7 @@ static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE];
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 
-static DEFINE_SPINLOCK(fib_multipath_lock);
+static DEFINE_PER_CPU(int, fib_multipath_rr_counter);
 
 #define for_nexthops(fi) {						\
 	int nhsel; const struct fib_nh *nh;				\
@@ -468,6 +469,52 @@ static int fib_count_nexthops(struct rtnexthop *rtnh, int remaining)
 	return remaining > 0 ? 0 : nhs;
 }
 
+static void fib_rebalance(struct fib_info *fi)
+{
+	int total;
+	int w;
+	struct in_device *in_dev;
+
+	if (fi->fib_nhs < 2)
+		return;
+
+	total = 0;
+	for_nexthops(fi) {
+		if (nh->nh_flags & RTNH_F_DEAD)
+			continue;
+
+		in_dev = __in_dev_get_rcu(nh->nh_dev);
+
+		if (in_dev &&
+		    IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
+		    nh->nh_flags & RTNH_F_LINKDOWN)
+			continue;
+
+		total += nh->nh_weight;
+	} endfor_nexthops(fi);
+
+	w = 0;
+	change_nexthops(fi) {
+		int upper_bound;
+
+		in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev);
+
+		if (nexthop_nh->nh_flags & RTNH_F_DEAD) {
+			upper_bound = -1;
+		} else if (in_dev &&
+			   IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
+			   nexthop_nh->nh_flags & RTNH_F_LINKDOWN) {
+			upper_bound = -1;
+		} else {
+			w += nexthop_nh->nh_weight;
+			upper_bound = DIV_ROUND_CLOSEST(2147483648LL * w,
+							total) - 1;
+		}
+
+		atomic_set(&nexthop_nh->nh_upper_bound, upper_bound);
+	} endfor_nexthops(fi);
+}
+
 static int fib_get_nhs(struct fib_info *fi, struct rtnexthop *rtnh,
 		       int remaining, struct fib_config *cfg)
 {
@@ -1077,8 +1124,15 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 
 	change_nexthops(fi) {
 		fib_info_update_nh_saddr(net, nexthop_nh);
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+		fi->fib_weight += nexthop_nh->nh_weight;
+#endif
 	} endfor_nexthops(fi)
 
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+	fib_rebalance(fi);
+#endif
+
 link_it:
 	ofi = fib_find_info(fi);
 	if (ofi) {
@@ -1300,12 +1354,6 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event)
 					nexthop_nh->nh_flags |= RTNH_F_LINKDOWN;
 					break;
 				}
-#ifdef CONFIG_IP_ROUTE_MULTIPATH
-				spin_lock_bh(&fib_multipath_lock);
-				fi->fib_power -= nexthop_nh->nh_power;
-				nexthop_nh->nh_power = 0;
-				spin_unlock_bh(&fib_multipath_lock);
-#endif
 				dead++;
 			}
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
@@ -1328,6 +1376,10 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event)
 			}
 			ret++;
 		}
+
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+		fib_rebalance(fi);
+#endif
 	}
 
 	return ret;
@@ -1450,20 +1502,17 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
 			    !__in_dev_get_rtnl(dev))
 				continue;
 			alive++;
-#ifdef CONFIG_IP_ROUTE_MULTIPATH
-			spin_lock_bh(&fib_multipath_lock);
-			nexthop_nh->nh_power = 0;
-			nexthop_nh->nh_flags &= ~nh_flags;
-			spin_unlock_bh(&fib_multipath_lock);
-#else
 			nexthop_nh->nh_flags &= ~nh_flags;
-#endif
 		} endfor_nexthops(fi)
 
 		if (alive > 0) {
 			fi->fib_flags &= ~nh_flags;
 			ret++;
 		}
+
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+		fib_rebalance(fi);
+#endif
 	}
 
 	return ret;
@@ -1478,55 +1527,19 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
 void fib_select_multipath(struct fib_result *res)
 {
 	struct fib_info *fi = res->fi;
-	struct in_device *in_dev;
-	int w;
-
-	spin_lock_bh(&fib_multipath_lock);
-	if (fi->fib_power <= 0) {
-		int power = 0;
-		change_nexthops(fi) {
-			in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev);
-			if (nexthop_nh->nh_flags & RTNH_F_DEAD)
-				continue;
-			if (in_dev &&
-			    IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
-			    nexthop_nh->nh_flags & RTNH_F_LINKDOWN)
-				continue;
-			power += nexthop_nh->nh_weight;
-			nexthop_nh->nh_power = nexthop_nh->nh_weight;
-		} endfor_nexthops(fi);
-		fi->fib_power = power;
-		if (power <= 0) {
-			spin_unlock_bh(&fib_multipath_lock);
-			/* Race condition: route has just become dead. */
-			res->nh_sel = 0;
-			return;
-		}
-	}
+	int h;
 
+	h = bitrev32(this_cpu_inc_return(fib_multipath_rr_counter)) >> 1;
 
-	/* w should be random number [0..fi->fib_power-1],
-	 * it is pretty bad approximation.
-	 */
-
-	w = jiffies % fi->fib_power;
+	for_nexthops(fi) {
+		if (h > atomic_read(&nh->nh_upper_bound))
+			continue;
 
-	change_nexthops(fi) {
-		if (!(nexthop_nh->nh_flags & RTNH_F_DEAD) &&
-		    nexthop_nh->nh_power) {
-			w -= nexthop_nh->nh_power;
-			if (w <= 0) {
-				nexthop_nh->nh_power--;
-				fi->fib_power--;
-				res->nh_sel = nhsel;
-				spin_unlock_bh(&fib_multipath_lock);
-				return;
-			}
-		}
-	} endfor_nexthops(fi);
+		res->nh_sel = nhsel;
+		return;
+	} endfor_nexthops(fi)
 
 	/* Race condition: route has just become dead. */
 	res->nh_sel = 0;
-	spin_unlock_bh(&fib_multipath_lock);
 }
 #endif
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 net-next 2/3] ipv4: L3 and L4 hash-based multipath routing
       [not found] ` <1440792050-2109-1-git-send-email-pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org>
  2015-08-28 20:00   ` [PATCH v2 net-next 1/3] ipv4: Lock-less per-packet multipath pch-chEQUL3jiZBWk0Htik3J/w
@ 2015-08-28 20:00   ` pch-chEQUL3jiZBWk0Htik3J/w
  2015-08-30 22:48     ` Tom Herbert
  2015-08-29 20:14   ` [PATCH v2 net-next 0/3] ipv4: Hash-based " David Miller
  2 siblings, 1 reply; 13+ messages in thread
From: pch-chEQUL3jiZBWk0Htik3J/w @ 2015-08-28 20:00 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Roopa Prabhu, Scott Feldman,
	Eric W. Biederman, Nicolas Dichtel, Thomas Graf, Jiri Benc,
	Peter Nørlund

From: Peter Nørlund <pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org>

This patch adds L3 and L4 hash-based multipath routing, selectable on a
per-route basis with the reintroduced RTA_MP_ALGO attribute. The default is
now RT_MP_ALG_L3_HASH.

Signed-off-by: Peter Nørlund <pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org>
---
 include/net/ip_fib.h           | 22 ++++++++++++++++-
 include/uapi/linux/rtnetlink.h | 14 ++++++++++-
 net/ipv4/fib_frontend.c        |  4 +++
 net/ipv4/fib_semantics.c       | 43 +++++++++++++++++++++++++++-----
 net/ipv4/route.c               | 56 ++++++++++++++++++++++++++++++++++++++++--
 5 files changed, 129 insertions(+), 10 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 18a3c7f..21e74b5 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -37,6 +37,7 @@ struct fib_config {
 	u32			fc_flags;
 	u32			fc_priority;
 	__be32			fc_prefsrc;
+	int			fc_mp_alg;
 	struct nlattr		*fc_mx;
 	struct rtnexthop	*fc_mp;
 	int			fc_mx_len;
@@ -119,6 +120,7 @@ struct fib_info {
 	int			fib_nhs;
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	int			fib_weight;
+	int			fib_mp_alg;
 #endif
 	struct rcu_head		rcu;
 	struct fib_nh		fib_nh[0];
@@ -312,7 +314,25 @@ int ip_fib_check_default(__be32 gw, struct net_device *dev);
 int fib_sync_down_dev(struct net_device *dev, unsigned long event);
 int fib_sync_down_addr(struct net *net, __be32 local);
 int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
-void fib_select_multipath(struct fib_result *res);
+
+struct multipath_flow4 {
+	__be32 saddr;
+	__be32 daddr;
+	union {
+		__be32 ports;
+		struct {
+			__be16 sport;
+			__be16 dport;
+		};
+	};
+};
+
+typedef void (*multipath_flow4_func_t)(struct multipath_flow4 *flow,
+				       void *ctx);
+
+void fib_select_multipath(struct fib_result *res,
+			  multipath_flow4_func_t flow_func,
+			  void *ctx);
 
 /* Exported by fib_trie.c */
 void fib_trie_init(void);
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 0d3d3cc..2563a96 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -271,6 +271,18 @@ enum rt_scope_t {
 #define RTM_F_EQUALIZE		0x400	/* Multipath equalizer: NI	*/
 #define RTM_F_PREFIX		0x800	/* Prefix addresses		*/
 
+/* Multipath algorithms */
+
+enum rt_mp_alg_t {
+	RT_MP_ALG_L3_HASH,	/* Was IP_MP_ALG_NONE */
+	RT_MP_ALG_PER_PACKET,	/* Was IP_MP_ALG_RR */
+	RT_MP_ALG_DRR,		/* not used */
+	RT_MP_ALG_RANDOM,	/* not used */
+	RT_MP_ALG_WRANDOM,	/* not used */
+	RT_MP_ALG_L4_HASH,
+	__RT_MP_ALG_MAX
+};
+
 /* Reserved table identifiers */
 
 enum rt_class_t {
@@ -301,7 +313,7 @@ enum rtattr_type_t {
 	RTA_FLOW,
 	RTA_CACHEINFO,
 	RTA_SESSION, /* no longer used */
-	RTA_MP_ALGO, /* no longer used */
+	RTA_MP_ALGO,
 	RTA_TABLE,
 	RTA_MARK,
 	RTA_MFC_STATS,
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 7fa2771..5ba4442 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -622,6 +622,7 @@ const struct nla_policy rtm_ipv4_policy[RTA_MAX + 1] = {
 	[RTA_PREFSRC]		= { .type = NLA_U32 },
 	[RTA_METRICS]		= { .type = NLA_NESTED },
 	[RTA_MULTIPATH]		= { .len = sizeof(struct rtnexthop) },
+	[RTA_MP_ALGO]		= { .type = NLA_U32 },
 	[RTA_FLOW]		= { .type = NLA_U32 },
 	[RTA_ENCAP_TYPE]	= { .type = NLA_U16 },
 	[RTA_ENCAP]		= { .type = NLA_NESTED },
@@ -684,6 +685,9 @@ static int rtm_to_fib_config(struct net *net, struct sk_buff *skb,
 			cfg->fc_mp = nla_data(attr);
 			cfg->fc_mp_len = nla_len(attr);
 			break;
+		case RTA_MP_ALGO:
+			cfg->fc_mp_alg = nla_get_u32(attr);
+			break;
 		case RTA_FLOW:
 			cfg->fc_flow = nla_get_u32(attr);
 			break;
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index becb63f..3a80b1a 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -259,6 +259,11 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi)
 {
 	const struct fib_nh *onh = ofi->fib_nh;
 
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+	if (fi->fib_mp_alg != ofi->fib_mp_alg)
+		return -1;
+#endif
+
 	for_nexthops(fi) {
 		if (nh->nh_oif != onh->nh_oif ||
 		    nh->nh_gw  != onh->nh_gw ||
@@ -1028,6 +1033,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 
 	if (cfg->fc_mp) {
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
+		fi->fib_mp_alg = cfg->fc_mp_alg;
 		err = fib_get_nhs(fi, cfg->fc_mp, cfg->fc_mp_len, cfg);
 		if (err != 0)
 			goto failure;
@@ -1245,6 +1251,10 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
 		struct rtnexthop *rtnh;
 		struct nlattr *mp;
 
+		if (fi->fib_mp_alg &&
+		    nla_put_u32(skb, RTA_MP_ALGO, fi->fib_mp_alg))
+			goto nla_put_failure;
+
 		mp = nla_nest_start(skb, RTA_MULTIPATH);
 		if (!mp)
 			goto nla_put_failure;
@@ -1520,16 +1530,37 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 
-/*
- * The algorithm is suboptimal, but it provides really
- * fair weighted route distribution.
- */
-void fib_select_multipath(struct fib_result *res)
+/* Compute multipath hash based on 3- or 5-tuple */
+static int fib_multipath_hash(const struct fib_result *res,
+			      multipath_flow4_func_t flow_func, void *ctx)
+{
+	struct multipath_flow4 flow;
+
+	flow_func(&flow, ctx);
+
+	if (res->fi->fib_mp_alg == RT_MP_ALG_L4_HASH)
+		return jhash_3words(flow.saddr, flow.daddr, flow.ports, 0) >> 1;
+	else
+		return jhash_2words(flow.saddr, flow.daddr, 0) >> 1;
+}
+
+static int fib_multipath_perpacket(void)
+{
+	return bitrev32(this_cpu_inc_return(fib_multipath_rr_counter)) >> 1;
+}
+
+void fib_select_multipath(struct fib_result *res,
+			  multipath_flow4_func_t flow_func,
+			  void *ctx)
 {
 	struct fib_info *fi = res->fi;
 	int h;
 
-	h = bitrev32(this_cpu_inc_return(fib_multipath_rr_counter)) >> 1;
+	if (res->fi->fib_mp_alg == RT_MP_ALG_PER_PACKET) {
+		h = fib_multipath_perpacket();
+	} else {
+		h = fib_multipath_hash(res, flow_func, ctx);
+	}
 
 	for_nexthops(fi) {
 		if (h > atomic_read(&nh->nh_upper_bound))
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f3087aa..f50f84f 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1643,6 +1643,58 @@ out:
 	return err;
 }
 
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+
+/* Fill multipath flow key data based on socket buffer */
+static void ip_multipath_flow_skb(struct multipath_flow4 *flow, void *ctx)
+{
+	const struct sk_buff *skb = (const struct sk_buff *)ctx;
+	const struct iphdr *iph;
+
+	iph = ip_hdr(skb);
+
+	flow->saddr = iph->saddr;
+	flow->daddr = iph->daddr;
+	flow->ports = 0;
+
+	if (unlikely(!(iph->frag_off & htons(IP_DF))))
+		return;
+
+	if (iph->protocol == IPPROTO_TCP ||
+	    iph->protocol == IPPROTO_UDP ||
+	    iph->protocol == IPPROTO_SCTP) {
+		__be16 _ports[2];
+		const __be16 *ports;
+
+		ports = skb_header_pointer(skb, iph->ihl * 4, sizeof(_ports),
+					   &_ports);
+		if (ports) {
+			flow->sport = ports[0];
+			flow->dport = ports[1];
+		}
+	}
+}
+
+/* Fill multipath flow key data based on flowi4  */
+static void ip_multipath_flow_fl4(struct multipath_flow4 *flow, void *ctx)
+{
+	const struct flowi4 *fl4 = (const struct flowi4 *)ctx;
+
+	flow->saddr = fl4->saddr;
+	flow->daddr = fl4->daddr;
+
+	if (fl4->flowi4_proto == IPPROTO_TCP ||
+	    fl4->flowi4_proto == IPPROTO_UDP ||
+	    fl4->flowi4_proto == IPPROTO_SCTP) {
+		flow->sport = fl4->fl4_sport;
+		flow->dport = fl4->fl4_dport;
+	} else {
+		flow->ports = 0;
+	}
+}
+
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
+
 static int ip_mkroute_input(struct sk_buff *skb,
 			    struct fib_result *res,
 			    const struct flowi4 *fl4,
@@ -1651,7 +1703,7 @@ static int ip_mkroute_input(struct sk_buff *skb,
 {
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	if (res->fi && res->fi->fib_nhs > 1)
-		fib_select_multipath(res);
+		fib_select_multipath(res, ip_multipath_flow_skb, skb);
 #endif
 
 	/* create a routing cache entry */
@@ -2197,7 +2249,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0)
-		fib_select_multipath(&res);
+		fib_select_multipath(&res, ip_multipath_flow_fl4, fl4);
 	else
 #endif
 	if (!res.prefixlen &&
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 net-next 3/3] ipv4: ICMP packet inspection for L3 multipath
  2015-08-28 20:00 [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing pch
@ 2015-08-28 20:00 ` pch
       [not found] ` <1440792050-2109-1-git-send-email-pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org>
  1 sibling, 0 replies; 13+ messages in thread
From: pch @ 2015-08-28 20:00 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, linux-api, Roopa Prabhu,
	Scott Feldman, Eric W. Biederman, Nicolas Dichtel, Thomas Graf,
	Jiri Benc, Peter Nørlund

From: Peter Nørlund <pch@ordbogen.com>

When doing L3 based multipath, ICMP packets are inspected to let them route
over the same path as the flow they relate to, allowing anycast
environments to work with ECMP.

Signed-off-by: Peter Nørlund <pch@ordbogen.com>
---
 include/net/ip_fib.h     |  2 +-
 include/net/route.h      | 12 ++++++-
 net/ipv4/fib_semantics.c |  2 +-
 net/ipv4/icmp.c          | 34 +++++++++++++++++++-
 net/ipv4/route.c         | 82 ++++++++++++++++++++++++++++++++++++++----------
 5 files changed, 112 insertions(+), 20 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 21e74b5..3e5d4ed 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -328,7 +328,7 @@ struct multipath_flow4 {
 };
 
 typedef void (*multipath_flow4_func_t)(struct multipath_flow4 *flow,
-				       void *ctx);
+				       enum rt_mp_alg_t algo, void *ctx);
 
 void fib_select_multipath(struct fib_result *res,
 			  multipath_flow4_func_t flow_func,
diff --git a/include/net/route.h b/include/net/route.h
index 395d79b..ccb85fc 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -28,6 +28,7 @@
 #include <net/inetpeer.h>
 #include <net/flow.h>
 #include <net/inet_sock.h>
+#include <net/ip_fib.h>
 #include <linux/in_route.h>
 #include <linux/rtnetlink.h>
 #include <linux/rcupdate.h>
@@ -110,7 +111,16 @@ struct in_device;
 int ip_rt_init(void);
 void rt_cache_flush(struct net *net);
 void rt_flush_dev(struct net_device *dev);
-struct rtable *__ip_route_output_key(struct net *, struct flowi4 *flp);
+struct rtable *__ip_route_output_key_flow(struct net *, struct flowi4 *flp,
+					  multipath_flow4_func_t flow_func,
+					  void *ctx);
+
+static inline struct rtable *__ip_route_output_key(struct net *net,
+						   struct flowi4 *flp)
+{
+	return __ip_route_output_key_flow(net, flp, NULL, NULL);
+}
+
 struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
 				    struct sock *sk);
 struct dst_entry *ipv4_blackhole_route(struct net *net,
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 3a80b1a..000c535 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1536,7 +1536,7 @@ static int fib_multipath_hash(const struct fib_result *res,
 {
 	struct multipath_flow4 flow;
 
-	flow_func(&flow, ctx);
+	flow_func(&flow, res->fi->fib_mp_alg, ctx);
 
 	if (res->fi->fib_mp_alg == RT_MP_ALG_L4_HASH)
 		return jhash_3words(flow.saddr, flow.daddr, flow.ports, 0) >> 1;
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index f16488e..0e25fe4 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -439,6 +439,38 @@ out_unlock:
 	icmp_xmit_unlock(sk);
 }
 
+/* Source and destination is swapped. See ip_multipath_flow_skb */
+static void icmp_multipath_flow(struct multipath_flow4 *flow,
+				enum rt_mp_alg_t algo, void *ctx)
+{
+	const struct sk_buff *skb = (const struct sk_buff *)ctx;
+	const struct iphdr *iph = ip_hdr(skb);
+
+	flow->saddr = iph->daddr;
+	flow->daddr = iph->saddr;
+	flow->ports = 0;
+
+	if (algo == RT_MP_ALG_L4_HASH)
+		return;
+
+	if (unlikely(!(iph->frag_off & htons(IP_DF))))
+		return;
+
+	if (iph->protocol == IPPROTO_TCP ||
+	    iph->protocol == IPPROTO_UDP ||
+	    iph->protocol == IPPROTO_SCTP) {
+		__be16 _ports[2];
+		const __be16 *ports;
+
+		ports = skb_header_pointer(skb, iph->ihl * 4, sizeof(_ports),
+					   &_ports);
+		if (ports) {
+			flow->sport = ports[1];
+			flow->dport = ports[0];
+		}
+	}
+}
+
 static struct rtable *icmp_route_lookup(struct net *net,
 					struct flowi4 *fl4,
 					struct sk_buff *skb_in,
@@ -463,7 +495,7 @@ static struct rtable *icmp_route_lookup(struct net *net,
 	fl4->flowi4_oif = vrf_master_ifindex(skb_in->dev) ? : skb_in->dev->ifindex;
 
 	security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
-	rt = __ip_route_output_key(net, fl4);
+	rt = __ip_route_output_key_flow(net, fl4, icmp_multipath_flow, skb_in);
 	if (IS_ERR(rt))
 		return rt;
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f50f84f..edbeb56 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1646,37 +1646,82 @@ out:
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 
 /* Fill multipath flow key data based on socket buffer */
-static void ip_multipath_flow_skb(struct multipath_flow4 *flow, void *ctx)
+static void ip_multipath_flow_skb(struct multipath_flow4 *flow,
+				  enum rt_mp_alg_t algo, void *ctx)
 {
 	const struct sk_buff *skb = (const struct sk_buff *)ctx;
-	const struct iphdr *iph;
+	struct icmphdr _icmph;
+	struct iphdr _inner_iph;
+	const struct iphdr *outer_iph;
+	const struct icmphdr *icmph;
+	const struct iphdr *inner_iph;
+	unsigned int offset;
 
-	iph = ip_hdr(skb);
+	outer_iph = ip_hdr(skb);
 
-	flow->saddr = iph->saddr;
-	flow->daddr = iph->daddr;
+	flow->saddr = outer_iph->saddr;
+	flow->daddr = outer_iph->daddr;
 	flow->ports = 0;
 
-	if (unlikely(!(iph->frag_off & htons(IP_DF))))
-		return;
+	offset = outer_iph->ihl * 4;
 
-	if (iph->protocol == IPPROTO_TCP ||
-	    iph->protocol == IPPROTO_UDP ||
-	    iph->protocol == IPPROTO_SCTP) {
+	if (algo == RT_MP_ALG_L4_HASH) {
 		__be16 _ports[2];
 		const __be16 *ports;
 
-		ports = skb_header_pointer(skb, iph->ihl * 4, sizeof(_ports),
+		if (unlikely(!(outer_iph->frag_off & htons(IP_DF))))
+			return;
+
+		if (outer_iph->protocol != IPPROTO_TCP &&
+		    outer_iph->protocol != IPPROTO_UDP &&
+		    outer_iph->protocol != IPPROTO_SCTP) {
+			return;
+		}
+
+		ports = skb_header_pointer(skb, offset, sizeof(_ports),
 					   &_ports);
 		if (ports) {
 			flow->sport = ports[0];
 			flow->dport = ports[1];
 		}
+
+		return;
+	}
+
+	if (outer_iph->protocol != IPPROTO_ICMP)
+		return;
+
+	if (unlikely((outer_iph->frag_off & htons(IP_OFFSET)) != 0))
+		return;
+
+	icmph = skb_header_pointer(skb, offset, sizeof(_icmph), &_icmph);
+	if (!icmph)
+		return;
+
+	if (icmph->type != ICMP_DEST_UNREACH &&
+	    icmph->type != ICMP_SOURCE_QUENCH &&
+	    icmph->type != ICMP_REDIRECT &&
+	    icmph->type != ICMP_TIME_EXCEEDED &&
+	    icmph->type != ICMP_PARAMETERPROB) {
+		return;
 	}
+
+	offset += sizeof(_icmph);
+	inner_iph = skb_header_pointer(skb, offset, sizeof(_inner_iph),
+				       &_inner_iph);
+	if (!inner_iph)
+		return;
+
+	/* Since the ICMP payload contains a packet sent from the current
+	 * recipient, we swap source and destination addresses
+	 */
+	flow->saddr = inner_iph->daddr;
+	flow->daddr = inner_iph->saddr;
 }
 
 /* Fill multipath flow key data based on flowi4  */
-static void ip_multipath_flow_fl4(struct multipath_flow4 *flow, void *ctx)
+static void ip_multipath_flow_fl4(struct multipath_flow4 *flow,
+				  enum rt_mp_alg_t algo, void *ctx)
 {
 	const struct flowi4 *fl4 = (const struct flowi4 *)ctx;
 
@@ -2086,7 +2131,9 @@ add:
  * Major route resolver routine.
  */
 
-struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
+struct rtable *__ip_route_output_key_flow(struct net *net, struct flowi4 *fl4,
+					  multipath_flow4_func_t flow_func,
+					  void *ctx)
 {
 	struct net_device *dev_out = NULL;
 	__u8 tos = RT_FL_TOS(fl4);
@@ -2248,9 +2295,12 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 	}
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-	if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0)
-		fib_select_multipath(&res, ip_multipath_flow_fl4, fl4);
-	else
+	if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0) {
+		if (flow_func)
+			fib_select_multipath(&res, flow_func, ctx);
+		else
+			fib_select_multipath(&res, ip_multipath_flow_fl4, fl4);
+	} else
 #endif
 	if (!res.prefixlen &&
 	    res.table->tb_num_default > 1 &&
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing
       [not found] ` <1440792050-2109-1-git-send-email-pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org>
  2015-08-28 20:00   ` [PATCH v2 net-next 1/3] ipv4: Lock-less per-packet multipath pch-chEQUL3jiZBWk0Htik3J/w
  2015-08-28 20:00   ` [PATCH v2 net-next 2/3] ipv4: L3 and L4 hash-based multipath routing pch-chEQUL3jiZBWk0Htik3J/w
@ 2015-08-29 20:14   ` David Miller
       [not found]     ` <20150829.131429.360433621593751136.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
  2 siblings, 1 reply; 13+ messages in thread
From: David Miller @ 2015-08-29 20:14 UTC (permalink / raw)
  To: pch-chEQUL3jiZBWk0Htik3J/w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, kuznet-v/Mj1YrvjDBInbfyfbPRSQ,
	jmorris-gx6/JNMH7DfYtjvyW6yDsg, yoshfuji-VfPWfsRibaP+Ru+s062T9g,
	kaber-dcUjhNyLwpNeoWH0uzbU5w, linux-api-u79uwXL29TY76Z2rM5mHXA,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	sfeldma-Re5JQEeQqe8AvxtiuMwx3w, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w, tgraf-G/eBtMaohhA,
	jbenc-H+wXaHxf7aLQT0dZR+AlfA

From: pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org
Date: Fri, 28 Aug 2015 22:00:47 +0200

> When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed
> from more or less being destination-based into being quasi-random per-packet
> scheduling. This increases the risk of out-of-order packets and makes it
> impossible to use multipath together with anycast services.

Don't even try to be fancy.

Simply kill the round-robin stuff off completely, and make hash based
routing the one and only mode, no special configuration stuff
necessary.

This puts ipv4 in line with ipv6, and nobody cares about the existing
round robin behavior at all.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing
       [not found]     ` <20150829.131429.360433621593751136.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
@ 2015-08-29 20:31       ` Peter Nørlund
  2015-08-29 20:46         ` David Miller
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Nørlund @ 2015-08-29 20:31 UTC (permalink / raw)
  To: David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, kuznet-v/Mj1YrvjDBInbfyfbPRSQ,
	jmorris-gx6/JNMH7DfYtjvyW6yDsg, yoshfuji-VfPWfsRibaP+Ru+s062T9g,
	kaber-dcUjhNyLwpNeoWH0uzbU5w, linux-api-u79uwXL29TY76Z2rM5mHXA,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	sfeldma-Re5JQEeQqe8AvxtiuMwx3w, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w, tgraf-G/eBtMaohhA,
	jbenc-H+wXaHxf7aLQT0dZR+AlfA

On Sat, 29 Aug 2015 13:14:29 -0700 (PDT)
David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:

> From: pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org
> Date: Fri, 28 Aug 2015 22:00:47 +0200
> 
> > When the routing cache was removed in 3.6, the IPv4 multipath
> > algorithm changed from more or less being destination-based into
> > being quasi-random per-packet scheduling. This increases the risk
> > of out-of-order packets and makes it impossible to use multipath
> > together with anycast services.
> 
> Don't even try to be fancy.
> 
> Simply kill the round-robin stuff off completely, and make hash based
> routing the one and only mode, no special configuration stuff
> necessary.

I like the sound of that! Just to be clear - are you telling me to
stick with L3 and skip the L4 part?

> 
> This puts ipv4 in line with ipv6, and nobody cares about the existing
> round robin behavior at all.
> 
> Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing
  2015-08-29 20:31       ` Peter Nørlund
@ 2015-08-29 20:46         ` David Miller
       [not found]           ` <20150829.134628.1013990034021542524.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: David Miller @ 2015-08-29 20:46 UTC (permalink / raw)
  To: pch-chEQUL3jiZBWk0Htik3J/w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, kuznet-v/Mj1YrvjDBInbfyfbPRSQ,
	jmorris-gx6/JNMH7DfYtjvyW6yDsg, yoshfuji-VfPWfsRibaP+Ru+s062T9g,
	kaber-dcUjhNyLwpNeoWH0uzbU5w, linux-api-u79uwXL29TY76Z2rM5mHXA,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	sfeldma-Re5JQEeQqe8AvxtiuMwx3w, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w, tgraf-G/eBtMaohhA,
	jbenc-H+wXaHxf7aLQT0dZR+AlfA

From: Peter Nørlund <pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org>
Date: Sat, 29 Aug 2015 22:31:15 +0200

> On Sat, 29 Aug 2015 13:14:29 -0700 (PDT)
> David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
> 
>> From: pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org
>> Date: Fri, 28 Aug 2015 22:00:47 +0200
>> 
>> > When the routing cache was removed in 3.6, the IPv4 multipath
>> > algorithm changed from more or less being destination-based into
>> > being quasi-random per-packet scheduling. This increases the risk
>> > of out-of-order packets and makes it impossible to use multipath
>> > together with anycast services.
>> 
>> Don't even try to be fancy.
>> 
>> Simply kill the round-robin stuff off completely, and make hash based
>> routing the one and only mode, no special configuration stuff
>> necessary.
> 
> I like the sound of that! Just to be clear - are you telling me to
> stick with L3 and skip the L4 part?

For now it seems best to just do L3 and make ipv4 and ipv6 behave the
same.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing
       [not found]           ` <20150829.134628.1013990034021542524.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
@ 2015-08-29 20:55             ` Scott Feldman
  2015-08-29 20:59             ` Tom Herbert
  1 sibling, 0 replies; 13+ messages in thread
From: Scott Feldman @ 2015-08-29 20:55 UTC (permalink / raw)
  To: David Miller
  Cc: pch-chEQUL3jiZBWk0Htik3J/w, Netdev, kuznet-v/Mj1YrvjDBInbfyfbPRSQ,
	jmorris-gx6/JNMH7DfYtjvyW6yDsg, yoshfuji-VfPWfsRibaP+Ru+s062T9g,
	kaber-dcUjhNyLwpNeoWH0uzbU5w, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Roopa Prabhu, ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org,
	Nicolas Dichtel, Thomas Graf, jbenc-H+wXaHxf7aLQT0dZR+AlfA

On Sat, Aug 29, 2015 at 1:46 PM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
> From: Peter Nørlund <pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org>
> Date: Sat, 29 Aug 2015 22:31:15 +0200
>
>> On Sat, 29 Aug 2015 13:14:29 -0700 (PDT)
>> David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
>>
>>> From: pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org
>>> Date: Fri, 28 Aug 2015 22:00:47 +0200
>>>
>>> > When the routing cache was removed in 3.6, the IPv4 multipath
>>> > algorithm changed from more or less being destination-based into
>>> > being quasi-random per-packet scheduling. This increases the risk
>>> > of out-of-order packets and makes it impossible to use multipath
>>> > together with anycast services.
>>>
>>> Don't even try to be fancy.
>>>
>>> Simply kill the round-robin stuff off completely, and make hash based
>>> routing the one and only mode, no special configuration stuff
>>> necessary.
>>
>> I like the sound of that! Just to be clear - are you telling me to
>> stick with L3 and skip the L4 part?
>
> For now it seems best to just do L3 and make ipv4 and ipv6 behave the
> same.

That makes mapping the hash to offload hardware easier also.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing
       [not found]           ` <20150829.134628.1013990034021542524.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
  2015-08-29 20:55             ` Scott Feldman
@ 2015-08-29 20:59             ` Tom Herbert
  2015-08-30 21:28               ` Peter Nørlund
  1 sibling, 1 reply; 13+ messages in thread
From: Tom Herbert @ 2015-08-29 20:59 UTC (permalink / raw)
  To: David Miller
  Cc: Peter Christensen, Linux Kernel Network Developers,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, linux-api-u79uwXL29TY76Z2rM5mHXA, Roopa Prabhu,
	sfeldma-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman,
	Nicolas Dichtel, Thomas Graf, Jiri Benc

On Sat, Aug 29, 2015 at 1:46 PM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
> From: Peter Nørlund <pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org>
> Date: Sat, 29 Aug 2015 22:31:15 +0200
>
>> On Sat, 29 Aug 2015 13:14:29 -0700 (PDT)
>> David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
>>
>>> From: pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org
>>> Date: Fri, 28 Aug 2015 22:00:47 +0200
>>>
>>> > When the routing cache was removed in 3.6, the IPv4 multipath
>>> > algorithm changed from more or less being destination-based into
>>> > being quasi-random per-packet scheduling. This increases the risk
>>> > of out-of-order packets and makes it impossible to use multipath
>>> > together with anycast services.
>>>
>>> Don't even try to be fancy.
>>>
>>> Simply kill the round-robin stuff off completely, and make hash based
>>> routing the one and only mode, no special configuration stuff
>>> necessary.
>>
>> I like the sound of that! Just to be clear - are you telling me to
>> stick with L3 and skip the L4 part?
>
> For now it seems best to just do L3 and make ipv4 and ipv6 behave the
> same.

This might be simpler if we just go directly to L4 which should be
better load balancing and what most switches are doing anyway. The
hash comes from:

1) If a lookup includes an skb, we just need to call skb_get_hash.
2) If we have a socket and sk->sk_txhash is nonzero then use that.
3) Else compute a hash frome flowi. We don't have the exact functions
for this, but they can be easily derived from __skb_get_hash_flowi4
and __skb_get_hash_flowi6 (i.e. create general get_hash_flowi4 and
get_hash_flowi6 and then call these from skb functions and multipath
lookup).

Tom

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing
  2015-08-29 20:59             ` Tom Herbert
@ 2015-08-30 21:28               ` Peter Nørlund
  2015-08-30 22:29                 ` Tom Herbert
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Nørlund @ 2015-08-30 21:28 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, Linux Kernel Network Developers, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, linux-api,
	Roopa Prabhu, sfeldma, Eric W. Biederman, Nicolas Dichtel,
	Thomas Graf, Jiri Benc

On Sat, 29 Aug 2015 13:59:08 -0700
Tom Herbert <tom@herbertland.com> wrote:

> On Sat, Aug 29, 2015 at 1:46 PM, David Miller <davem@davemloft.net>
> wrote:
> > From: Peter Nørlund <pch@ordbogen.com>
> > Date: Sat, 29 Aug 2015 22:31:15 +0200
> >
> >> On Sat, 29 Aug 2015 13:14:29 -0700 (PDT)
> >> David Miller <davem@davemloft.net> wrote:
> >>
> >>> From: pch@ordbogen.com
> >>> Date: Fri, 28 Aug 2015 22:00:47 +0200
> >>>
> >>> > When the routing cache was removed in 3.6, the IPv4 multipath
> >>> > algorithm changed from more or less being destination-based into
> >>> > being quasi-random per-packet scheduling. This increases the
> >>> > risk of out-of-order packets and makes it impossible to use
> >>> > multipath together with anycast services.
> >>>
> >>> Don't even try to be fancy.
> >>>
> >>> Simply kill the round-robin stuff off completely, and make hash
> >>> based routing the one and only mode, no special configuration
> >>> stuff necessary.
> >>
> >> I like the sound of that! Just to be clear - are you telling me to
> >> stick with L3 and skip the L4 part?
> >
> > For now it seems best to just do L3 and make ipv4 and ipv6 behave
> > the same.
> 
> This might be simpler if we just go directly to L4 which should be
> better load balancing and what most switches are doing anyway. The
> hash comes from:
> 
> 1) If a lookup includes an skb, we just need to call skb_get_hash.
> 2) If we have a socket and sk->sk_txhash is nonzero then use that.
> 3) Else compute a hash frome flowi. We don't have the exact functions
> for this, but they can be easily derived from __skb_get_hash_flowi4
> and __skb_get_hash_flowi6 (i.e. create general get_hash_flowi4 and
> get_hash_flowi6 and then call these from skb functions and multipath
> lookup).

It would definitely be simpler, and it would be nice to just fetch the
hash directly from the NIC - and for link aggregation it would probably
be fine. But with L4, we always need to consider fragmented packets,
which might cause some packets of a flow to be routed differently - and
with ECMP, the ramifications of suddenly choosing another path for a
flow are worse than for link aggregation. The latency through the
different paths may differ enough to cause out-or-order packets and bad
TCP performance as a consequence. Both Cisco and Juniper routers
defaults to L3 for ECMP - exactly for that reason, I believe. RFC 2991
also points out that ports probably shouldn't be used as part of the
flow key with ECMP.

With anycast it is even worse. Depending on how anycast is used,
changing path may destroy a TCP connection. And without special
treatment of ICMP, ICMP packets may hit another anycast node, causing
PMTU to fail. Cloudflare recognized this and solved it by letting a
user space daemon (pmtud) route ICMP packets through all paths, ensuring
that the anycast node receives the ICMP. But a more efficient solution
is to handle the issue within the kernel.

It might be possible to do L4 on flows using PMTU, and while it
is possible to extract addresses and ports from the ICMP payload, you
can't rely the DF-bit in the ICMP payload, since it comes from the
opposite flow (Flow A->B use PMTU while B->A doesn't). I guess you
can technically reduce the number of possible paths to two, though.

I obviously prefer the default to be L3 with ICMP handling, since I
specifically plan to use ECMP together with anycast (albeit anycasted
load balancers which synchronizes states, although delayed), but I also
recognizes that anycast is a special case. Question is, it is so much a
special case that it belongs outside the vanilla kernel?

Regards,
 Peter Nørlund

> 
> Tom
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing
  2015-08-30 21:28               ` Peter Nørlund
@ 2015-08-30 22:29                 ` Tom Herbert
  2015-08-31  9:02                   ` Thomas Graf
  0 siblings, 1 reply; 13+ messages in thread
From: Tom Herbert @ 2015-08-30 22:29 UTC (permalink / raw)
  To: Peter Nørlund
  Cc: David Miller, Linux Kernel Network Developers, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Roopa Prabhu, sfeldma,
	Eric W. Biederman, Nicolas Dichtel, Thomas Graf, Jiri Benc

On Sun, Aug 30, 2015 at 2:28 PM, Peter Nørlund <pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org> wrote:
> On Sat, 29 Aug 2015 13:59:08 -0700
> Tom Herbert <tom-BjP2VixgY4xUbtYUoyoikg@public.gmane.org> wrote:
>
>> On Sat, Aug 29, 2015 at 1:46 PM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
>> wrote:
>> > From: Peter Nørlund <pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org>
>> > Date: Sat, 29 Aug 2015 22:31:15 +0200
>> >
>> >> On Sat, 29 Aug 2015 13:14:29 -0700 (PDT)
>> >> David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
>> >>
>> >>> From: pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org
>> >>> Date: Fri, 28 Aug 2015 22:00:47 +0200
>> >>>
>> >>> > When the routing cache was removed in 3.6, the IPv4 multipath
>> >>> > algorithm changed from more or less being destination-based into
>> >>> > being quasi-random per-packet scheduling. This increases the
>> >>> > risk of out-of-order packets and makes it impossible to use
>> >>> > multipath together with anycast services.
>> >>>
>> >>> Don't even try to be fancy.
>> >>>
>> >>> Simply kill the round-robin stuff off completely, and make hash
>> >>> based routing the one and only mode, no special configuration
>> >>> stuff necessary.
>> >>
>> >> I like the sound of that! Just to be clear - are you telling me to
>> >> stick with L3 and skip the L4 part?
>> >
>> > For now it seems best to just do L3 and make ipv4 and ipv6 behave
>> > the same.
>>
>> This might be simpler if we just go directly to L4 which should be
>> better load balancing and what most switches are doing anyway. The
>> hash comes from:
>>
>> 1) If a lookup includes an skb, we just need to call skb_get_hash.
>> 2) If we have a socket and sk->sk_txhash is nonzero then use that.
>> 3) Else compute a hash frome flowi. We don't have the exact functions
>> for this, but they can be easily derived from __skb_get_hash_flowi4
>> and __skb_get_hash_flowi6 (i.e. create general get_hash_flowi4 and
>> get_hash_flowi6 and then call these from skb functions and multipath
>> lookup).
>
> It would definitely be simpler, and it would be nice to just fetch the
> hash directly from the NIC - and for link aggregation it would probably
> be fine. But with L4, we always need to consider fragmented packets,
> which might cause some packets of a flow to be routed differently - and
> with ECMP, the ramifications of suddenly choosing another path for a
> flow are worse than for link aggregation. The latency through the
> different paths may differ enough to cause out-or-order packets and bad
> TCP performance as a consequence. Both Cisco and Juniper routers
> defaults to L3 for ECMP - exactly for that reason, I believe. RFC 2991
> also points out that ports probably shouldn't be used as part of the
> flow key with ECMP.
>
That's more reason why we need vendors to use IPv6 flow label instead
of ports to do ECMP :-). In any case, if we're fragmenting TCP packets
then we're already in a bad place performance-wise-- we really don't
need to optimize for that case. Albeit, it would be nice if fragments
of packet  followed same path, but the would require devices to not do
L4 hash over ports when MF is set-- I don't know if anyone does that
(I have been meaning to add that to stack).

> With anycast it is even worse. Depending on how anycast is used,
> changing path may destroy a TCP connection. And without special
> treatment of ICMP, ICMP packets may hit another anycast node, causing
> PMTU to fail. Cloudflare recognized this and solved it by letting a
> user space daemon (pmtud) route ICMP packets through all paths, ensuring
> that the anycast node receives the ICMP. But a more efficient solution
> is to handle the issue within the kernel.
>
> It might be possible to do L4 on flows using PMTU, and while it
> is possible to extract addresses and ports from the ICMP payload, you
> can't rely the DF-bit in the ICMP payload, since it comes from the
> opposite flow (Flow A->B use PMTU while B->A doesn't). I guess you
> can technically reduce the number of possible paths to two, though.
>
> I obviously prefer the default to be L3 with ICMP handling, since I
> specifically plan to use ECMP together with anycast (albeit anycasted
> load balancers which synchronizes states, although delayed), but I also
> recognizes that anycast is a special case. Question is, it is so much a
> special case that it belongs outside the vanilla kernel?
>
OTOH, if the hash is always dependent on fixed fields of a connection
(L3 or L4) then the path can never change during the lifetime of a
connection, this is a bad thing if we want to try a different path
when a connection is failing (via ipv4_negative_advice). This is why
there is value is using sk->sk_txhash as a route selector.

It is stunning that anycast works at all given it's dependency on the
network path being stack, but I suppose it is functionality we'll need
to preserve.

Tom

> Regards,
>  Peter Nørlund
>
>>
>> Tom
>>
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe netdev" in
>> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 2/3] ipv4: L3 and L4 hash-based multipath routing
  2015-08-28 20:00   ` [PATCH v2 net-next 2/3] ipv4: L3 and L4 hash-based multipath routing pch-chEQUL3jiZBWk0Htik3J/w
@ 2015-08-30 22:48     ` Tom Herbert
  0 siblings, 0 replies; 13+ messages in thread
From: Tom Herbert @ 2015-08-30 22:48 UTC (permalink / raw)
  To: Peter Christensen
  Cc: Linux Kernel Network Developers, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, linux-api, Roopa Prabhu, Scott Feldman,
	Eric W. Biederman, Nicolas Dichtel, Thomas Graf, Jiri Benc

On Fri, Aug 28, 2015 at 1:00 PM,  <pch@ordbogen.com> wrote:
> From: Peter Nørlund <pch@ordbogen.com>
>
> This patch adds L3 and L4 hash-based multipath routing, selectable on a
> per-route basis with the reintroduced RTA_MP_ALGO attribute. The default is
> now RT_MP_ALG_L3_HASH.
>
> Signed-off-by: Peter Nørlund <pch@ordbogen.com>
> ---
>  include/net/ip_fib.h           | 22 ++++++++++++++++-
>  include/uapi/linux/rtnetlink.h | 14 ++++++++++-
>  net/ipv4/fib_frontend.c        |  4 +++
>  net/ipv4/fib_semantics.c       | 43 +++++++++++++++++++++++++++-----
>  net/ipv4/route.c               | 56 ++++++++++++++++++++++++++++++++++++++++--
>  5 files changed, 129 insertions(+), 10 deletions(-)
>
> diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> index 18a3c7f..21e74b5 100644
> --- a/include/net/ip_fib.h
> +++ b/include/net/ip_fib.h
> @@ -37,6 +37,7 @@ struct fib_config {
>         u32                     fc_flags;
>         u32                     fc_priority;
>         __be32                  fc_prefsrc;
> +       int                     fc_mp_alg;
>         struct nlattr           *fc_mx;
>         struct rtnexthop        *fc_mp;
>         int                     fc_mx_len;
> @@ -119,6 +120,7 @@ struct fib_info {
>         int                     fib_nhs;
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>         int                     fib_weight;
> +       int                     fib_mp_alg;
>  #endif
>         struct rcu_head         rcu;
>         struct fib_nh           fib_nh[0];
> @@ -312,7 +314,25 @@ int ip_fib_check_default(__be32 gw, struct net_device *dev);
>  int fib_sync_down_dev(struct net_device *dev, unsigned long event);
>  int fib_sync_down_addr(struct net *net, __be32 local);
>  int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
> -void fib_select_multipath(struct fib_result *res);
> +
> +struct multipath_flow4 {

Please use flowi4 structure instead.

> +       __be32 saddr;
> +       __be32 daddr;
> +       union {
> +               __be32 ports;
> +               struct {
> +                       __be16 sport;
> +                       __be16 dport;
> +               };
> +       };
> +};
> +
> +typedef void (*multipath_flow4_func_t)(struct multipath_flow4 *flow,
> +                                      void *ctx);
> +
> +void fib_select_multipath(struct fib_result *res,
> +                         multipath_flow4_func_t flow_func,
> +                         void *ctx);
>
>  /* Exported by fib_trie.c */
>  void fib_trie_init(void);
> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
> index 0d3d3cc..2563a96 100644
> --- a/include/uapi/linux/rtnetlink.h
> +++ b/include/uapi/linux/rtnetlink.h
> @@ -271,6 +271,18 @@ enum rt_scope_t {
>  #define RTM_F_EQUALIZE         0x400   /* Multipath equalizer: NI      */
>  #define RTM_F_PREFIX           0x800   /* Prefix addresses             */
>
> +/* Multipath algorithms */
> +
> +enum rt_mp_alg_t {
> +       RT_MP_ALG_L3_HASH,      /* Was IP_MP_ALG_NONE */
> +       RT_MP_ALG_PER_PACKET,   /* Was IP_MP_ALG_RR */
> +       RT_MP_ALG_DRR,          /* not used */
> +       RT_MP_ALG_RANDOM,       /* not used */
> +       RT_MP_ALG_WRANDOM,      /* not used */
> +       RT_MP_ALG_L4_HASH,
> +       __RT_MP_ALG_MAX
> +};
> +
>  /* Reserved table identifiers */
>
>  enum rt_class_t {
> @@ -301,7 +313,7 @@ enum rtattr_type_t {
>         RTA_FLOW,
>         RTA_CACHEINFO,
>         RTA_SESSION, /* no longer used */
> -       RTA_MP_ALGO, /* no longer used */
> +       RTA_MP_ALGO,
>         RTA_TABLE,
>         RTA_MARK,
>         RTA_MFC_STATS,
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index 7fa2771..5ba4442 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -622,6 +622,7 @@ const struct nla_policy rtm_ipv4_policy[RTA_MAX + 1] = {
>         [RTA_PREFSRC]           = { .type = NLA_U32 },
>         [RTA_METRICS]           = { .type = NLA_NESTED },
>         [RTA_MULTIPATH]         = { .len = sizeof(struct rtnexthop) },
> +       [RTA_MP_ALGO]           = { .type = NLA_U32 },
>         [RTA_FLOW]              = { .type = NLA_U32 },
>         [RTA_ENCAP_TYPE]        = { .type = NLA_U16 },
>         [RTA_ENCAP]             = { .type = NLA_NESTED },
> @@ -684,6 +685,9 @@ static int rtm_to_fib_config(struct net *net, struct sk_buff *skb,
>                         cfg->fc_mp = nla_data(attr);
>                         cfg->fc_mp_len = nla_len(attr);
>                         break;
> +               case RTA_MP_ALGO:
> +                       cfg->fc_mp_alg = nla_get_u32(attr);
> +                       break;
>                 case RTA_FLOW:
>                         cfg->fc_flow = nla_get_u32(attr);
>                         break;
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index becb63f..3a80b1a 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -259,6 +259,11 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi)
>  {
>         const struct fib_nh *onh = ofi->fib_nh;
>
> +#ifdef CONFIG_IP_ROUTE_MULTIPATH
> +       if (fi->fib_mp_alg != ofi->fib_mp_alg)
> +               return -1;
> +#endif
> +
>         for_nexthops(fi) {
>                 if (nh->nh_oif != onh->nh_oif ||
>                     nh->nh_gw  != onh->nh_gw ||
> @@ -1028,6 +1033,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
>
>         if (cfg->fc_mp) {
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> +               fi->fib_mp_alg = cfg->fc_mp_alg;
>                 err = fib_get_nhs(fi, cfg->fc_mp, cfg->fc_mp_len, cfg);
>                 if (err != 0)
>                         goto failure;
> @@ -1245,6 +1251,10 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
>                 struct rtnexthop *rtnh;
>                 struct nlattr *mp;
>
> +               if (fi->fib_mp_alg &&
> +                   nla_put_u32(skb, RTA_MP_ALGO, fi->fib_mp_alg))
> +                       goto nla_put_failure;
> +
>                 mp = nla_nest_start(skb, RTA_MULTIPATH);
>                 if (!mp)
>                         goto nla_put_failure;
> @@ -1520,16 +1530,37 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
>
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>
> -/*
> - * The algorithm is suboptimal, but it provides really
> - * fair weighted route distribution.
> - */
> -void fib_select_multipath(struct fib_result *res)
> +/* Compute multipath hash based on 3- or 5-tuple */
> +static int fib_multipath_hash(const struct fib_result *res,
> +                             multipath_flow4_func_t flow_func, void *ctx)
> +{
> +       struct multipath_flow4 flow;
> +
> +       flow_func(&flow, ctx);
> +
> +       if (res->fi->fib_mp_alg == RT_MP_ALG_L4_HASH)
> +               return jhash_3words(flow.saddr, flow.daddr, flow.ports, 0) >> 1;
> +       else
> +               return jhash_2words(flow.saddr, flow.daddr, 0) >> 1;

Please define/use general functions. As I pointed out an L4 hash may
already be available via skb_get_hash or sk->tx_hash. For L3 we could
add a function that does flow dissector to only get addresses and then
hash over that. Also, jhash really should include a randomized key to
avoid any concerns about DOS.

> +}
> +
> +static int fib_multipath_perpacket(void)
> +{
> +       return bitrev32(this_cpu_inc_return(fib_multipath_rr_counter)) >> 1;
> +}
> +
> +void fib_select_multipath(struct fib_result *res,
> +                         multipath_flow4_func_t flow_func,
> +                         void *ctx)
>  {
>         struct fib_info *fi = res->fi;
>         int h;
>
> -       h = bitrev32(this_cpu_inc_return(fib_multipath_rr_counter)) >> 1;
> +       if (res->fi->fib_mp_alg == RT_MP_ALG_PER_PACKET) {
> +               h = fib_multipath_perpacket();
> +       } else {
> +               h = fib_multipath_hash(res, flow_func, ctx);
> +       }
>
>         for_nexthops(fi) {
>                 if (h > atomic_read(&nh->nh_upper_bound))
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index f3087aa..f50f84f 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1643,6 +1643,58 @@ out:
>         return err;
>  }
>
> +#ifdef CONFIG_IP_ROUTE_MULTIPATH
> +
> +/* Fill multipath flow key data based on socket buffer */
> +static void ip_multipath_flow_skb(struct multipath_flow4 *flow, void *ctx)

Please use flow_dissector for this, don't define a new dissection method.

> +{
> +       const struct sk_buff *skb = (const struct sk_buff *)ctx;
> +       const struct iphdr *iph;
> +
> +       iph = ip_hdr(skb);
> +
> +       flow->saddr = iph->saddr;
> +       flow->daddr = iph->daddr;
> +       flow->ports = 0;
> +
> +       if (unlikely(!(iph->frag_off & htons(IP_DF))))
> +               return;
> +
> +       if (iph->protocol == IPPROTO_TCP ||
> +           iph->protocol == IPPROTO_UDP ||
> +           iph->protocol == IPPROTO_SCTP) {
> +               __be16 _ports[2];
> +               const __be16 *ports;
> +
> +               ports = skb_header_pointer(skb, iph->ihl * 4, sizeof(_ports),
> +                                          &_ports);
> +               if (ports) {
> +                       flow->sport = ports[0];
> +                       flow->dport = ports[1];
> +               }
> +       }
> +}
> +
> +/* Fill multipath flow key data based on flowi4  */
> +static void ip_multipath_flow_fl4(struct multipath_flow4 *flow, void *ctx)
> +{
> +       const struct flowi4 *fl4 = (const struct flowi4 *)ctx;
> +
> +       flow->saddr = fl4->saddr;
> +       flow->daddr = fl4->daddr;
> +
> +       if (fl4->flowi4_proto == IPPROTO_TCP ||
> +           fl4->flowi4_proto == IPPROTO_UDP ||
> +           fl4->flowi4_proto == IPPROTO_SCTP) {
> +               flow->sport = fl4->fl4_sport;
> +               flow->dport = fl4->fl4_dport;
> +       } else {
> +               flow->ports = 0;
> +       }
> +}
> +
> +#endif /* CONFIG_IP_ROUTE_MULTIPATH */
> +
>  static int ip_mkroute_input(struct sk_buff *skb,
>                             struct fib_result *res,
>                             const struct flowi4 *fl4,
> @@ -1651,7 +1703,7 @@ static int ip_mkroute_input(struct sk_buff *skb,
>  {
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>         if (res->fi && res->fi->fib_nhs > 1)
> -               fib_select_multipath(res);
> +               fib_select_multipath(res, ip_multipath_flow_skb, skb);
>  #endif
>
>         /* create a routing cache entry */
> @@ -2197,7 +2249,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
>
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>         if (res.fi->fib_nhs > 1 && fl4->flowi4_oif == 0)
> -               fib_select_multipath(&res);
> +               fib_select_multipath(&res, ip_multipath_flow_fl4, fl4);
>         else
>  #endif
>         if (!res.prefixlen &&
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing
  2015-08-30 22:29                 ` Tom Herbert
@ 2015-08-31  9:02                   ` Thomas Graf
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Graf @ 2015-08-31  9:02 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Peter N??rlund, David Miller, Linux Kernel Network Developers,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, linux-api, Roopa Prabhu, sfeldma,
	Eric W. Biederman, Nicolas Dichtel, Jiri Benc

On 08/30/15 at 03:29pm, Tom Herbert wrote:
> On Sun, Aug 30, 2015 at 2:28 PM, Peter N??rlund <pch@ordbogen.com> wrote:
> > It would definitely be simpler, and it would be nice to just fetch the
> > hash directly from the NIC - and for link aggregation it would probably
> > be fine. But with L4, we always need to consider fragmented packets,
> > which might cause some packets of a flow to be routed differently - and
> > with ECMP, the ramifications of suddenly choosing another path for a
> > flow are worse than for link aggregation. The latency through the
> > different paths may differ enough to cause out-or-order packets and bad
> > TCP performance as a consequence. Both Cisco and Juniper routers
> > defaults to L3 for ECMP - exactly for that reason, I believe. RFC 2991
> > also points out that ports probably shouldn't be used as part of the
> > flow key with ECMP.
> >
> That's more reason why we need vendors to use IPv6 flow label instead
> of ports to do ECMP :-). In any case, if we're fragmenting TCP packets
> then we're already in a bad place performance-wise-- we really don't
> need to optimize for that case. Albeit, it would be nice if fragments
> of packet  followed same path, but the would require devices to not do
> L4 hash over ports when MF is set-- I don't know if anyone does that
> (I have been meaning to add that to stack).

+1 for solving this at hash level. Being able to rely on the L4 HW
hash for multipath routing is very desirable. A simple MF bit ||
FO > 0 check with fall back to flow dissector to generate an L3 hash
in case the HW provided an L4 hash should be sufficient to address the
fragmentation concern.

Since performance is gone anyway, I'm not sure it's worth offloading
this behaviour to the HW.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-08-31  9:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-28 20:00 [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing pch
2015-08-28 20:00 ` [PATCH v2 net-next 3/3] ipv4: ICMP packet inspection for L3 multipath pch
     [not found] ` <1440792050-2109-1-git-send-email-pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org>
2015-08-28 20:00   ` [PATCH v2 net-next 1/3] ipv4: Lock-less per-packet multipath pch-chEQUL3jiZBWk0Htik3J/w
2015-08-28 20:00   ` [PATCH v2 net-next 2/3] ipv4: L3 and L4 hash-based multipath routing pch-chEQUL3jiZBWk0Htik3J/w
2015-08-30 22:48     ` Tom Herbert
2015-08-29 20:14   ` [PATCH v2 net-next 0/3] ipv4: Hash-based " David Miller
     [not found]     ` <20150829.131429.360433621593751136.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
2015-08-29 20:31       ` Peter Nørlund
2015-08-29 20:46         ` David Miller
     [not found]           ` <20150829.134628.1013990034021542524.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
2015-08-29 20:55             ` Scott Feldman
2015-08-29 20:59             ` Tom Herbert
2015-08-30 21:28               ` Peter Nørlund
2015-08-30 22:29                 ` Tom Herbert
2015-08-31  9:02                   ` Thomas Graf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).