netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -next 0/3] tcp: receive-side per route dctcp handling
@ 2015-08-28 12:11 Florian Westphal
  2015-08-28 12:11 ` [PATCH -next 1/3] net: fib: move metrics parsing to a helper Florian Westphal
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Florian Westphal @ 2015-08-28 12:11 UTC (permalink / raw)
  To: netdev

Currently, the following case doesn't use DCTCP, even if it should:

- responder has f.e. cubic as system wide default
- 'ip route congctl dctcp $src' was set

Then, DCTCP is NOT used if a DCTCP sender attempts to connect from a
host in the $src range: ECT(0) is set, but listen_sk is not dctcp, so we
fail the INET_ECN_is_not_ect sanity check.

We also have to examine the dst used for the SYN/ACK reply to make this
case work.

In order to minimize additional cost, store the 'ecn is must have'
information is the dst_features field.

The set targets -next instead of -net since this doesn't seem to be a
serious bug and to give the change more soak time until it hits linus tree.

Daniel Borkmann (2):
      net: fib6: reduce identation in ip6_convert_metrics
      tcp: use dctcp if enabled on the route to the initiator

Florian Westphal (1):
      net: fib: move metrics parsing to a helper

 include/net/tcp.h              |  2 +-
 include/uapi/linux/rtnetlink.h | 11 +++---
 net/core/rtnetlink.c           |  4 +++
 net/ipv4/fib_semantics.c       | 78 ++++++++++++++++++++++++++----------------
 net/ipv4/tcp_cong.c            |  9 +++--
 net/ipv4/tcp_input.c           |  7 ++--
 net/ipv6/route.c               | 39 ++++++++++++---------
 7 files changed, 94 insertions(+), 56 deletions(-)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH -next 1/3] net: fib: move metrics parsing to a helper
  2015-08-28 12:11 [PATCH -next 0/3] tcp: receive-side per route dctcp handling Florian Westphal
@ 2015-08-28 12:11 ` Florian Westphal
  2015-08-28 12:11 ` [PATCH -next 2/3] net: fib6: reduce identation in ip6_convert_metrics Florian Westphal
  2015-08-28 12:11 ` [PATCH -next 3/3] tcp: use dctcp if enabled on the route to the initiator Florian Westphal
  2 siblings, 0 replies; 6+ messages in thread
From: Florian Westphal @ 2015-08-28 12:11 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

fib_create_info() is already quite large, so before adding more
code to the metrics section move that to a helper, similar to
ip6_convert_metrics.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv4/fib_semantics.c | 71 ++++++++++++++++++++++++++++--------------------
 1 file changed, 41 insertions(+), 30 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 1b2d011..88afbae 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -876,6 +876,44 @@ static bool fib_valid_prefsrc(struct fib_config *cfg, __be32 fib_prefsrc)
 	return true;
 }
 
+static int
+fib_convert_metrics(struct fib_info *fi, const struct fib_config *cfg)
+{
+	struct nlattr *nla;
+	int remaining;
+
+	if (!cfg->fc_mx)
+		return 0;
+
+	nla_for_each_attr(nla, cfg->fc_mx, cfg->fc_mx_len, remaining) {
+		int type = nla_type(nla);
+		u32 val;
+
+		if (!type)
+			continue;
+		if (type > RTAX_MAX)
+			return -EINVAL;
+
+		if (type == RTAX_CC_ALGO) {
+			char tmp[TCP_CA_NAME_MAX];
+
+			nla_strlcpy(tmp, nla, sizeof(tmp));
+			val = tcp_ca_get_key_by_name(tmp);
+			if (val == TCP_CA_UNSPEC)
+				return -EINVAL;
+		} else {
+			val = nla_get_u32(nla);
+		}
+		if (type == RTAX_ADVMSS && val > 65535 - 40)
+			val = 65535 - 40;
+		if (type == RTAX_MTU && val > 65535 - 15)
+			val = 65535 - 15;
+		fi->fib_metrics[type - 1] = val;
+	}
+
+	return 0;
+}
+
 struct fib_info *fib_create_info(struct fib_config *cfg)
 {
 	int err;
@@ -948,36 +986,9 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 			goto failure;
 	} endfor_nexthops(fi)
 
-	if (cfg->fc_mx) {
-		struct nlattr *nla;
-		int remaining;
-
-		nla_for_each_attr(nla, cfg->fc_mx, cfg->fc_mx_len, remaining) {
-			int type = nla_type(nla);
-
-			if (type) {
-				u32 val;
-
-				if (type > RTAX_MAX)
-					goto err_inval;
-				if (type == RTAX_CC_ALGO) {
-					char tmp[TCP_CA_NAME_MAX];
-
-					nla_strlcpy(tmp, nla, sizeof(tmp));
-					val = tcp_ca_get_key_by_name(tmp);
-					if (val == TCP_CA_UNSPEC)
-						goto err_inval;
-				} else {
-					val = nla_get_u32(nla);
-				}
-				if (type == RTAX_ADVMSS && val > 65535 - 40)
-					val = 65535 - 40;
-				if (type == RTAX_MTU && val > 65535 - 15)
-					val = 65535 - 15;
-				fi->fib_metrics[type - 1] = val;
-			}
-		}
-	}
+	err = fib_convert_metrics(fi, cfg);
+	if (err)
+		goto failure;
 
 	if (cfg->fc_mp) {
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH -next 2/3] net: fib6: reduce identation in ip6_convert_metrics
  2015-08-28 12:11 [PATCH -next 0/3] tcp: receive-side per route dctcp handling Florian Westphal
  2015-08-28 12:11 ` [PATCH -next 1/3] net: fib: move metrics parsing to a helper Florian Westphal
@ 2015-08-28 12:11 ` Florian Westphal
  2015-08-28 12:11 ` [PATCH -next 3/3] tcp: use dctcp if enabled on the route to the initiator Florian Westphal
  2 siblings, 0 replies; 6+ messages in thread
From: Florian Westphal @ 2015-08-28 12:11 UTC (permalink / raw)
  To: netdev; +Cc: Daniel Borkmann

From: Daniel Borkmann <daniel@iogearbox.net>

Reduce the identation a bit, there's no need to artificically have
it increased.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 net/ipv6/route.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index df3e353..56b2e71 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1711,26 +1711,26 @@ static int ip6_convert_metrics(struct mx6_config *mxc,
 
 	nla_for_each_attr(nla, cfg->fc_mx, cfg->fc_mx_len, remaining) {
 		int type = nla_type(nla);
+		u32 val;
 
-		if (type) {
-			u32 val;
-
-			if (unlikely(type > RTAX_MAX))
-				goto err;
-			if (type == RTAX_CC_ALGO) {
-				char tmp[TCP_CA_NAME_MAX];
+		if (!type)
+			continue;
+		if (unlikely(type > RTAX_MAX))
+			goto err;
 
-				nla_strlcpy(tmp, nla, sizeof(tmp));
-				val = tcp_ca_get_key_by_name(tmp);
-				if (val == TCP_CA_UNSPEC)
-					goto err;
-			} else {
-				val = nla_get_u32(nla);
-			}
+		if (type == RTAX_CC_ALGO) {
+			char tmp[TCP_CA_NAME_MAX];
 
-			mp[type - 1] = val;
-			__set_bit(type - 1, mxc->mx_valid);
+			nla_strlcpy(tmp, nla, sizeof(tmp));
+			val = tcp_ca_get_key_by_name(tmp);
+			if (val == TCP_CA_UNSPEC)
+				goto err;
+		} else {
+			val = nla_get_u32(nla);
 		}
+
+		mp[type - 1] = val;
+		__set_bit(type - 1, mxc->mx_valid);
 	}
 
 	mxc->mx = mp;
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH -next 3/3] tcp: use dctcp if enabled on the route to the initiator
  2015-08-28 12:11 [PATCH -next 0/3] tcp: receive-side per route dctcp handling Florian Westphal
  2015-08-28 12:11 ` [PATCH -next 1/3] net: fib: move metrics parsing to a helper Florian Westphal
  2015-08-28 12:11 ` [PATCH -next 2/3] net: fib6: reduce identation in ip6_convert_metrics Florian Westphal
@ 2015-08-28 12:11 ` Florian Westphal
  2015-08-28 21:18   ` David Miller
  2 siblings, 1 reply; 6+ messages in thread
From: Florian Westphal @ 2015-08-28 12:11 UTC (permalink / raw)
  To: netdev; +Cc: Daniel Borkmann, Florian Westphal

From: Daniel Borkmann <daniel@iogearbox.net>

Currently, the following case doesn't use DCTCP, even if it should:
A responder has f.e. Cubic as system wide default, but for a specific
route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The
initiating host then uses DCTCP as congestion control, but since the
initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok,
and we have to fall back to Reno after 3WHS completes.

We were thinking on how to solve this in a minimal, non-intrusive
way without bloating tcp_ecn_create_request() needlessly: lets cache
the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0)
is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES
contains RTAX_FEATURE_ECN_CA. This allows to only do a single metric
feature lookup inside tcp_ecn_create_request().

Joint work with Florian Westphal.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/tcp.h              |  2 +-
 include/uapi/linux/rtnetlink.h | 11 +++++++----
 net/core/rtnetlink.c           |  4 ++++
 net/ipv4/fib_semantics.c       |  9 ++++++++-
 net/ipv4/tcp_cong.c            |  9 ++++++---
 net/ipv4/tcp_input.c           |  7 +++++--
 net/ipv6/route.c               | 11 +++++++++--
 7 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4a7b039..0cab28c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -888,7 +888,7 @@ void tcp_reno_cong_avoid(struct sock *sk, u32 ack, u32 acked);
 extern struct tcp_congestion_ops tcp_reno;
 
 struct tcp_congestion_ops *tcp_ca_find_key(u32 key);
-u32 tcp_ca_get_key_by_name(const char *name);
+u32 tcp_ca_get_key_by_name(const char *name, bool *ecn_ca);
 #ifdef CONFIG_INET
 char *tcp_ca_get_name_by_key(u32 key, char *buffer);
 #else
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 0d3d3cc..a5eb242 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -418,10 +418,13 @@ enum {
 
 #define RTAX_MAX (__RTAX_MAX - 1)
 
-#define RTAX_FEATURE_ECN	0x00000001
-#define RTAX_FEATURE_SACK	0x00000002
-#define RTAX_FEATURE_TIMESTAMP	0x00000004
-#define RTAX_FEATURE_ALLFRAG	0x00000008
+#define RTAX_FEATURE_ECN	(1 << 0)
+#define RTAX_FEATURE_SACK	(1 << 1)
+#define RTAX_FEATURE_TIMESTAMP	(1 << 2)
+#define RTAX_FEATURE_ALLFRAG	(1 << 3)
+#define RTAX_FEATURE_ECN_CA	(1 << 4)
+
+#define RTAX_FEATURE_MASK_ECN	(RTAX_FEATURE_ECN | RTAX_FEATURE_ECN_CA)
 
 struct rta_session {
 	__u8	proto;
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 788ceed..12a9b5a 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -678,6 +678,10 @@ int rtnetlink_put_metrics(struct sk_buff *skb, u32 *metrics)
 					continue;
 				if (nla_put_string(skb, i + 1, name))
 					goto nla_put_failure;
+			} else if (i == RTAX_FEATURES - 1) {
+				u32 feat = metrics[i] & ~RTAX_FEATURE_ECN_CA;
+				if (nla_put_u32(skb, i + 1, feat))
+					goto nla_put_failure;
 			} else {
 				if (nla_put_u32(skb, i + 1, metrics[i]))
 					goto nla_put_failure;
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 88afbae..d7ecc92 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -879,6 +879,7 @@ static bool fib_valid_prefsrc(struct fib_config *cfg, __be32 fib_prefsrc)
 static int
 fib_convert_metrics(struct fib_info *fi, const struct fib_config *cfg)
 {
+	bool ecn_ca = false;
 	struct nlattr *nla;
 	int remaining;
 
@@ -898,7 +899,7 @@ fib_convert_metrics(struct fib_info *fi, const struct fib_config *cfg)
 			char tmp[TCP_CA_NAME_MAX];
 
 			nla_strlcpy(tmp, nla, sizeof(tmp));
-			val = tcp_ca_get_key_by_name(tmp);
+			val = tcp_ca_get_key_by_name(tmp, &ecn_ca);
 			if (val == TCP_CA_UNSPEC)
 				return -EINVAL;
 		} else {
@@ -908,9 +909,15 @@ fib_convert_metrics(struct fib_info *fi, const struct fib_config *cfg)
 			val = 65535 - 40;
 		if (type == RTAX_MTU && val > 65535 - 15)
 			val = 65535 - 15;
+		if (type == RTAX_FEATURES)
+			val &= ~RTAX_FEATURE_ECN_CA;
+
 		fi->fib_metrics[type - 1] = val;
 	}
 
+	if (ecn_ca)
+		fi->fib_metrics[RTAX_FEATURES - 1] |= RTAX_FEATURE_ECN_CA;
+
 	return 0;
 }
 
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index a2ed23c..93c4dc3 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -114,16 +114,19 @@ void tcp_unregister_congestion_control(struct tcp_congestion_ops *ca)
 }
 EXPORT_SYMBOL_GPL(tcp_unregister_congestion_control);
 
-u32 tcp_ca_get_key_by_name(const char *name)
+u32 tcp_ca_get_key_by_name(const char *name, bool *ecn_ca)
 {
 	const struct tcp_congestion_ops *ca;
-	u32 key;
+	u32 key = TCP_CA_UNSPEC;
 
 	might_sleep();
 
 	rcu_read_lock();
 	ca = __tcp_ca_find_autoload(name);
-	key = ca ? ca->key : TCP_CA_UNSPEC;
+	if (ca) {
+		key = ca->key;
+		*ecn_ca = ca->flags & TCP_CONG_NEEDS_ECN;
+	}
 	rcu_read_unlock();
 
 	return key;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index dc08e23..39f0375 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6003,14 +6003,17 @@ static void tcp_ecn_create_request(struct request_sock *req,
 	const struct net *net = sock_net(listen_sk);
 	bool th_ecn = th->ece && th->cwr;
 	bool ect, ecn_ok;
+	u32 ecn_ok_dst;
 
 	if (!th_ecn)
 		return;
 
 	ect = !INET_ECN_is_not_ect(TCP_SKB_CB(skb)->ip_dsfield);
-	ecn_ok = net->ipv4.sysctl_tcp_ecn || dst_feature(dst, RTAX_FEATURE_ECN);
+	ecn_ok_dst = dst_feature(dst, RTAX_FEATURE_MASK_ECN);
+	ecn_ok = net->ipv4.sysctl_tcp_ecn || ecn_ok_dst;
 
-	if ((!ect && ecn_ok) || tcp_ca_needs_ecn(listen_sk))
+	if ((!ect && ecn_ok) || tcp_ca_needs_ecn(listen_sk) ||
+	    (ecn_ok_dst & RTAX_FEATURE_ECN_CA))
 		inet_rsk(req)->ecn_ok = 1;
 }
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 56b2e71..f1d3de4 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1698,6 +1698,7 @@ out:
 static int ip6_convert_metrics(struct mx6_config *mxc,
 			       const struct fib6_config *cfg)
 {
+	bool ecn_ca = false;
 	struct nlattr *nla;
 	int remaining;
 	u32 *mp;
@@ -1722,19 +1723,25 @@ static int ip6_convert_metrics(struct mx6_config *mxc,
 			char tmp[TCP_CA_NAME_MAX];
 
 			nla_strlcpy(tmp, nla, sizeof(tmp));
-			val = tcp_ca_get_key_by_name(tmp);
+			val = tcp_ca_get_key_by_name(tmp, &ecn_ca);
 			if (val == TCP_CA_UNSPEC)
 				goto err;
 		} else {
 			val = nla_get_u32(nla);
 		}
+		if (type == RTAX_FEATURES)
+			val &= ~RTAX_FEATURE_ECN_CA;
 
 		mp[type - 1] = val;
 		__set_bit(type - 1, mxc->mx_valid);
 	}
 
-	mxc->mx = mp;
+	if (ecn_ca) {
+		__set_bit(RTAX_FEATURES - 1, mxc->mx_valid);
+		mp[RTAX_FEATURES - 1] |= RTAX_FEATURE_ECN_CA;
+	}
 
+	mxc->mx = mp;
 	return 0;
  err:
 	kfree(mp);
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH -next 3/3] tcp: use dctcp if enabled on the route to the initiator
  2015-08-28 12:11 ` [PATCH -next 3/3] tcp: use dctcp if enabled on the route to the initiator Florian Westphal
@ 2015-08-28 21:18   ` David Miller
  2015-08-29 20:51     ` Daniel Borkmann
  0 siblings, 1 reply; 6+ messages in thread
From: David Miller @ 2015-08-28 21:18 UTC (permalink / raw)
  To: fw; +Cc: netdev, daniel

From: Florian Westphal <fw@strlen.de>
Date: Fri, 28 Aug 2015 14:11:09 +0200

> @@ -418,10 +418,13 @@ enum {
>  
>  #define RTAX_MAX (__RTAX_MAX - 1)
>  
> -#define RTAX_FEATURE_ECN	0x00000001
> -#define RTAX_FEATURE_SACK	0x00000002
> -#define RTAX_FEATURE_TIMESTAMP	0x00000004
> -#define RTAX_FEATURE_ALLFRAG	0x00000008
> +#define RTAX_FEATURE_ECN	(1 << 0)
> +#define RTAX_FEATURE_SACK	(1 << 1)
> +#define RTAX_FEATURE_TIMESTAMP	(1 << 2)
> +#define RTAX_FEATURE_ALLFRAG	(1 << 3)
> +#define RTAX_FEATURE_ECN_CA	(1 << 4)
> +
> +#define RTAX_FEATURE_MASK_ECN	(RTAX_FEATURE_ECN | RTAX_FEATURE_ECN_CA)

I don't know about this.

You're adding a new user visible feature bit, but...

The user can set it and silently the kernel will accept it, but this
bit is ignored.

Regardless of it's internal value, we never publish it to the user.

This is just asking for trouble, and it's a bit sloppy if you ask me.

I think you're going to need to hold this piece of state outside of
the metric value, sorry.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH -next 3/3] tcp: use dctcp if enabled on the route to the initiator
  2015-08-28 21:18   ` David Miller
@ 2015-08-29 20:51     ` Daniel Borkmann
  0 siblings, 0 replies; 6+ messages in thread
From: Daniel Borkmann @ 2015-08-29 20:51 UTC (permalink / raw)
  To: David Miller, fw; +Cc: netdev

On 08/28/2015 11:18 PM, David Miller wrote:
...
> I don't know about this.
>
> You're adding a new user visible feature bit, but...
>
> The user can set it and silently the kernel will accept it, but this
> bit is ignored.
>
> Regardless of it's internal value, we never publish it to the user.
>
> This is just asking for trouble, and it's a bit sloppy if you ask me.
>
> I think you're going to need to hold this piece of state outside of
> the metric value, sorry.

Understood, point taken, the main advantage here was to just leverage
the single metric lookup we already have anyway to also derive this
indication at minimal cost. But, we will also revisit some different
possibilities, such as evaluating RTAX_CC_ALGO.

I think it could still be solved the current way, if we change fibs to
return with EINVAL in case someone sets a feature bit that is completely
unknown to the kernel (in other words, one of the remaining unused 28
bits) of that metric value, then it could be solved by hiding all this
entirely in kernel space. Perhaps some part of that bit range might also
be interesting for future kernel-only features to be indicated there in
the dst metric w/o any exposure.

Thanks for your feedback,
Daniel

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-08-29 20:51 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-28 12:11 [PATCH -next 0/3] tcp: receive-side per route dctcp handling Florian Westphal
2015-08-28 12:11 ` [PATCH -next 1/3] net: fib: move metrics parsing to a helper Florian Westphal
2015-08-28 12:11 ` [PATCH -next 2/3] net: fib6: reduce identation in ip6_convert_metrics Florian Westphal
2015-08-28 12:11 ` [PATCH -next 3/3] tcp: use dctcp if enabled on the route to the initiator Florian Westphal
2015-08-28 21:18   ` David Miller
2015-08-29 20:51     ` Daniel Borkmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).