Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next] netlink: Warn on unordered or illegal nla_nest_cancel() or nlmsg_cancel()
From: Thomas Graf @ 2015-01-06  0:04 UTC (permalink / raw)
  To: davem; +Cc: netdev

Calling nla_nest_cancel() in a different order as the nesting was
built up can lead to negative offsets being calculated which
results in skb_trim() being called with an underflowed unsigned
int. Warn if mark < skb->data as it's definitely a bug.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/net/netlink.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index 6415835..d5869b9 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -520,8 +520,10 @@ static inline void *nlmsg_get_pos(struct sk_buff *skb)
  */
 static inline void nlmsg_trim(struct sk_buff *skb, const void *mark)
 {
-	if (mark)
+	if (mark) {
+		WARN_ON((unsigned char *) mark < skb->data);
 		skb_trim(skb, (unsigned char *) mark - skb->data);
+	}
 }
 
 /**
-- 
1.9.3

^ permalink raw reply related

* Re: [PATCH net-next v3 0/5]: ixgbevf: Allow querying VFs RSS indirection table and key
From: Greg Rose @ 2015-01-05 23:54 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: netdev, gleb, avi, jeffrey.t.kirsher
In-Reply-To: <1420467311-6680-1-git-send-email-vladz@cloudius-systems.com>

On Mon, Jan 5, 2015 at 6:15 AM, Vlad Zolotarov
<vladz@cloudius-systems.com> wrote:
> Add the ethtool ops to VF driver to allow querying the RSS indirection table
> and RSS Random Key.
>
>  - PF driver: Add new VF-PF channel commands.
>  - VF driver: Utilize these new commands and add the corresponding
>               ethtool callbacks.
>
> New in v3:
>    - Added a missing support for x550 devices.
>    - Mask the indirection table values according to PSRTYPE[n].RQPL.
>    - Minimized the number of added VF-PF commands.
>
> New in v2:
>    - Added a detailed description to patches 4 and 5.
>
> New in v1 (compared to RFC):
>    - Use "if-else" statement instead of a "switch-case" for a single option case.
>      More specifically: in cases where the newly added API version is the only one
>      allowed. We may consider using a "switch-case" back again when the list of
>      allowed API versions in these specific places grows up.
>
> Vlad Zolotarov (5):
>   ixgbe: Add a RETA query command to VF-PF channel API
>   ixgbevf: Add a RETA query code
>   ixgbe: Add GET_RSS_KEY command to VF-PF channel commands set
>   ixgbevf: Add RSS Key query code
>   ixgbevf: Add the appropriate ethtool ops to query RSS indirection
>     table and key
>
>  drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h      |  10 ++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c    |  91 +++++++++++++++
>  drivers/net/ethernet/intel/ixgbevf/ethtool.c      |  43 +++++++
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   4 +-
>  drivers/net/ethernet/intel/ixgbevf/mbx.h          |  10 ++
>  drivers/net/ethernet/intel/ixgbevf/vf.c           | 132 ++++++++++++++++++++++
>  drivers/net/ethernet/intel/ixgbevf/vf.h           |   2 +
>  7 files changed, 291 insertions(+), 1 deletion(-)

I've given this code a review and I don't see a way to
set a policy in the PF driver as to whether this request should be
allowed or not.  We cannot enable this query by default - it is a
security risk. To make this acceptable you need to do a
couple of things.

A) Have the query disabled by default such that when a VF driver
requests the RSS info the request is denied.

B) Add hooks to allow system admins to set the policy in the PF driver
as to whether the RSS info requests from the VFs are allowed or
denied.  Only provide the VF the privilege to request the RSS info if
the system admin has explicitly set the policy to allow it.  All other
times the request should be denied.

As it stands this is a non-starter.  Privileged information cannot be
made available to VFs without a way for the system admin to set
policy as to whether the information should be made available or not.

- Greg Rose
Intel Corp
Networking Division
<gregory.v.rose@intel.com>


>
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* route/max_size sysctl in ipv4
From: Ani Sinha @ 2015-01-05 23:48 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi all :

Please see : http://lxr.free-electrons.com/source/Documentation/networking/ip-sysctl.txt

I am looking at the code and it looks like since the route cache for
ipv4 was removed from the kernel, this sysctl parameter no longer
serves the same purpose. It does not look like it is even used in the
ipv4/route.c module. Is there an equivalent sysctl parameter limiting
the number of route entries in the kernel? Or is there now no
mechanism to limit the number of route entries?

Please someone enlighten me.

Thanks a lot in advance,
ani

^ permalink raw reply

* Re: [net-next PATCH v1 01/11] net: flow_table: create interface for hw match/action tables
From: John Fastabend @ 2015-01-05 23:29 UTC (permalink / raw)
  To: Thomas Graf; +Cc: sfeldma, jiri, jhs, simon.horman, netdev, davem, andy
In-Reply-To: <54AADEFF.3090306@gmail.com>

On 01/05/2015 10:59 AM, John Fastabend wrote:

[...]

>>> +#ifndef _UAPI_LINUX_IF_FLOW
>>> +#define _UAPI_LINUX_IF_FLOW
>>> +
>>> +#include <linux/types.h>
>>> +#include <linux/netlink.h>
>>> +#include <linux/if.h>
>>> +
>>> +#define NET_FLOW_NAMSIZ 80
>>
>> Did you consider allocating the memory for names? I don't have a grasp
>> for the typical number of net_flow_* instances in memory yet.
>>
>
> <100k in the devices I have. Maybe Simon can pitch in what is typical
> on the NPUs I'm not sure about them.
>
> Rocker tables can grow as large as needed at the moment.
>
> Allocating the memory may help I'll go ahead and give it a try.
>

One issue with breaking this is up is a couple structures are being
passed as attributes with name[] as a field. I think its best to break
these up passing empty arrays seems to be ugly at best. So I'll need to
adjust some of the messaging as well.

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: r8169 crash also in 3.16.7
From: Francois Romieu @ 2015-01-05 23:00 UTC (permalink / raw)
  To: ael; +Cc: netdev
In-Reply-To: <20150105112044.GA5659@shelf.conquest>

ael <lawrence_a_e@ntlworld.com> :
[...]
> Jan  4 22:09:16 shelf kernel: [10650.233180] WARNING: CPU: 0 PID: 0 at /build/linux-CMiYW9/linux-3.16.7-ckt2/net/sched/sch_generic.c:264 dev_watchdog+0x236/0x240()
[...]
> Jan  4 22:09:16 shelf kernel: [10650.233263] Hardware name: Notebook                         W54_55SU1,SUW/W54_55SU1,SUW, BIOS 4.6.5 05/29/2014

The current kernel contains several changes for your (RTL_GIGA_MAC_VER_44)
8411 r8169 chipset that 3.16.7 doesn't know of. You should give it a try.

-- 
Ueimor

^ permalink raw reply

* [PATCH net-next v3 6/6] net: tcp: add per route congestion control
From: Daniel Borkmann @ 2015-01-05 22:57 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev
In-Reply-To: <1420498668-4660-1-git-send-email-dborkman@redhat.com>

This work adds the possibility to define a per route/destination
congestion control algorithm. Generally, this opens up the possibility
for a machine with different links to enforce specific congestion
control algorithms with optimal strategies for each of them based
on their network characteristics, even transparently for a single
application listening on all links.

For our specific use case, this additionally facilitates deployment
of DCTCP, for example, applications can easily serve internal
traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
would also allow for utilizing e.g. long living, low priority
background flows for certain destinations/routes while still being
able for normal traffic to utilize the default congestion control
algorithm. We also thought about a per netns setting (where different
defaults are possible), but given its actually a link specific
property, we argue that a per route/destination setting is the most
natural and flexible.

The administrator can utilize this through ip-route(8) by appending
"congctl [lock] <name>", where <name> denotes the name of a
congestion control algorithm and the optional lock parameter allows
to enforce the given algorithm so that applications in user space
would not be allowed to overwrite that algorithm for that destination.

The dst metric lookups are being done when a dst entry is already
available in order to avoid a costly lookup and still before the
algorithms are being initialized, thus overhead is very low when the
feature is not being used. While the client side would need to drop
the current reference on the module, on server side this can actually
even be avoided as we just got a flat-copied socket clone.

Joint work with Florian Westphal.

Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/net/tcp.h        |  6 ++++++
 net/ipv4/tcp_ipv4.c      |  2 ++
 net/ipv4/tcp_minisocks.c | 30 ++++++++++++++++++++++++++----
 net/ipv4/tcp_output.c    | 21 +++++++++++++++++++++
 net/ipv6/tcp_ipv6.c      |  2 ++
 5 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 95bb237..b8fdc6b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -448,6 +448,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb);
 struct sock *tcp_create_openreq_child(struct sock *sk,
 				      struct request_sock *req,
 				      struct sk_buff *skb);
+void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst);
 struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 				  struct request_sock *req,
 				  struct dst_entry *dst);
@@ -636,6 +637,11 @@ static inline u32 tcp_rto_min_us(struct sock *sk)
 	return jiffies_to_usecs(tcp_rto_min(sk));
 }
 
+static inline bool tcp_ca_dst_locked(const struct dst_entry *dst)
+{
+	return dst_metric_locked(dst, RTAX_CC_ALGO);
+}
+
 /* Compute the actual receive window we are currently advertising.
  * Rcv_nxt can be after the window if our peer push more data
  * than the offered window.
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a3f72d7..ad3e65b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1340,6 +1340,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 	}
 	sk_setup_caps(newsk, dst);
 
+	tcp_ca_openreq_child(newsk, dst);
+
 	tcp_sync_mss(newsk, dst_mtu(dst));
 	newtp->advmss = dst_metric_advmss(dst);
 	if (tcp_sk(sk)->rx_opt.user_mss &&
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 63d2680..bc9216d 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -399,6 +399,32 @@ static void tcp_ecn_openreq_child(struct tcp_sock *tp,
 	tp->ecn_flags = inet_rsk(req)->ecn_ok ? TCP_ECN_OK : 0;
 }
 
+void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	u32 ca_key = dst_metric(dst, RTAX_CC_ALGO);
+	bool ca_got_dst = false;
+
+	if (ca_key != TCP_CA_UNSPEC) {
+		const struct tcp_congestion_ops *ca;
+
+		rcu_read_lock();
+		ca = tcp_ca_find_key(ca_key);
+		if (likely(ca && try_module_get(ca->owner))) {
+			icsk->icsk_ca_dst_locked = tcp_ca_dst_locked(dst);
+			icsk->icsk_ca_ops = ca;
+			ca_got_dst = true;
+		}
+		rcu_read_unlock();
+	}
+
+	if (!ca_got_dst && !try_module_get(icsk->icsk_ca_ops->owner))
+		tcp_assign_congestion_control(sk);
+
+	tcp_set_ca_state(sk, TCP_CA_Open);
+}
+EXPORT_SYMBOL_GPL(tcp_ca_openreq_child);
+
 /* This is not only more efficient than what we used to do, it eliminates
  * a lot of code duplication between IPv4/IPv6 SYN recv processing. -DaveM
  *
@@ -451,10 +477,6 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		newtp->snd_cwnd = TCP_INIT_CWND;
 		newtp->snd_cwnd_cnt = 0;
 
-		if (!try_module_get(newicsk->icsk_ca_ops->owner))
-			tcp_assign_congestion_control(newsk);
-
-		tcp_set_ca_state(newsk, TCP_CA_Open);
 		tcp_init_xmit_timers(newsk);
 		__skb_queue_head_init(&newtp->out_of_order_queue);
 		newtp->write_seq = newtp->pushed_seq = treq->snt_isn + 1;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7f18262..dc30cb5 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2939,6 +2939,25 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 }
 EXPORT_SYMBOL(tcp_make_synack);
 
+static void tcp_ca_dst_init(struct sock *sk, const struct dst_entry *dst)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	const struct tcp_congestion_ops *ca;
+	u32 ca_key = dst_metric(dst, RTAX_CC_ALGO);
+
+	if (ca_key == TCP_CA_UNSPEC)
+		return;
+
+	rcu_read_lock();
+	ca = tcp_ca_find_key(ca_key);
+	if (likely(ca && try_module_get(ca->owner))) {
+		module_put(icsk->icsk_ca_ops->owner);
+		icsk->icsk_ca_dst_locked = tcp_ca_dst_locked(dst);
+		icsk->icsk_ca_ops = ca;
+	}
+	rcu_read_unlock();
+}
+
 /* Do all connect socket setups that can be done AF independent. */
 static void tcp_connect_init(struct sock *sk)
 {
@@ -2964,6 +2983,8 @@ static void tcp_connect_init(struct sock *sk)
 	tcp_mtup_init(sk);
 	tcp_sync_mss(sk, dst_mtu(dst));
 
+	tcp_ca_dst_init(sk, dst);
+
 	if (!tp->window_clamp)
 		tp->window_clamp = dst_metric(dst, RTAX_WINDOW);
 	tp->advmss = dst_metric_advmss(dst);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 9c0b54e..5d46832 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1199,6 +1199,8 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 		inet_csk(newsk)->icsk_ext_hdr_len = (newnp->opt->opt_nflen +
 						     newnp->opt->opt_flen);
 
+	tcp_ca_openreq_child(newsk, dst);
+
 	tcp_sync_mss(newsk, dst_mtu(dst));
 	newtp->advmss = dst_metric_advmss(dst);
 	if (tcp_sk(sk)->rx_opt.user_mss &&
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v3 1/6] net: fib6: fib6_commit_metrics: fix potential NULL pointer dereference
From: Daniel Borkmann @ 2015-01-05 22:57 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev, Michal Kubeček
In-Reply-To: <1420498668-4660-1-git-send-email-dborkman@redhat.com>

When IPv6 host routes with metrics attached are being added, we fetch
the metrics store from the dst via COW through dst_metrics_write_ptr(),
added through commit e5fd387ad5b3.

One remaining problem here is that we actually call into inet_getpeer()
and may end up allocating/creating a new peer from the kmemcache, which
may fail.

Example trace from perf probe (inet_getpeer:41) where create is 1:

ip 6877 [002] 4221.391591: probe:inet_getpeer: (ffffffff8165e293)
  85e294 inet_getpeer.part.7 (<- kmem_cache_alloc())
  85e578 inet_getpeer
  8eb333 ipv6_cow_metrics
  8f10ff fib6_commit_metrics

Therefore, a check for NULL on the return of dst_metrics_write_ptr()
is necessary here.

Joint work with Florian Westphal.

Fixes: e5fd387ad5b3 ("ipv6: do not overwrite inetpeer metrics prematurely")
Cc: Michal Kubeček <mkubecek@suse.cz>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 net/ipv6/ip6_fib.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index b2d1838..db4984e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -633,18 +633,17 @@ static bool rt6_qualify_for_ecmp(struct rt6_info *rt)
 static int fib6_commit_metrics(struct dst_entry *dst,
 			       struct nlattr *mx, int mx_len)
 {
+	bool dst_host = dst->flags & DST_HOST;
 	struct nlattr *nla;
 	int remaining;
 	u32 *mp;
 
-	if (dst->flags & DST_HOST) {
-		mp = dst_metrics_write_ptr(dst);
-	} else {
-		mp = kzalloc(sizeof(u32) * RTAX_MAX, GFP_ATOMIC);
-		if (!mp)
-			return -ENOMEM;
+	mp = dst_host ? dst_metrics_write_ptr(dst) :
+			kzalloc(sizeof(u32) * RTAX_MAX, GFP_ATOMIC);
+	if (unlikely(!mp))
+		return -ENOMEM;
+	if (!dst_host)
 		dst_init_metrics(dst, mp, 0);
-	}
 
 	nla_for_each_attr(nla, mx, mx_len, remaining) {
 		int type = nla_type(nla);
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v3 5/6] net: tcp: add RTAX_CC_ALGO fib handling
From: Daniel Borkmann @ 2015-01-05 22:57 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev
In-Reply-To: <1420498668-4660-1-git-send-email-dborkman@redhat.com>

This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
control metric to be set up and dumped back to user space.

While the internal representation of RTAX_CC_ALGO is handled as a u32
key, we avoided to expose this implementation detail to user space, thus
instead, we chose the netlink attribute that is being exchanged between
user space to be the actual congestion control algorithm name, similarly
as in the setsockopt(2) API in order to allow for maximum flexibility,
even for 3rd party modules.

It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
it should have been stored in RTAX_FEATURES instead, we first thought
about reusing it for the congestion control key, but it brings more
complications and/or confusion than worth it.

Joint work with Florian Westphal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/net/tcp.h              |  7 +++++++
 include/uapi/linux/rtnetlink.h |  2 ++
 net/core/rtnetlink.c           | 15 +++++++++++++--
 net/decnet/dn_fib.c            |  3 ++-
 net/decnet/dn_table.c          |  4 +++-
 net/ipv4/fib_semantics.c       | 14 ++++++++++++--
 net/ipv6/route.c               | 17 +++++++++++++++--
 7 files changed, 54 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 135b70c..95bb237 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -846,7 +846,14 @@ extern struct tcp_congestion_ops tcp_reno;
 
 struct tcp_congestion_ops *tcp_ca_find_key(u32 key);
 u32 tcp_ca_get_key_by_name(const char *name);
+#ifdef CONFIG_INET
 char *tcp_ca_get_name_by_key(u32 key, char *buffer);
+#else
+static inline char *tcp_ca_get_name_by_key(u32 key, char *buffer)
+{
+	return NULL;
+}
+#endif
 
 static inline bool tcp_ca_needs_ecn(const struct sock *sk)
 {
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 9c9b8b4..d81f22d 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -389,6 +389,8 @@ enum {
 #define RTAX_INITRWND RTAX_INITRWND
 	RTAX_QUICKACK,
 #define RTAX_QUICKACK RTAX_QUICKACK
+	RTAX_CC_ALGO,
+#define RTAX_CC_ALGO RTAX_CC_ALGO
 	__RTAX_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 9cf6fe9..001372f 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -50,6 +50,7 @@
 #include <net/arp.h>
 #include <net/route.h>
 #include <net/udp.h>
+#include <net/tcp.h>
 #include <net/sock.h>
 #include <net/pkt_sched.h>
 #include <net/fib_rules.h>
@@ -669,9 +670,19 @@ int rtnetlink_put_metrics(struct sk_buff *skb, u32 *metrics)
 
 	for (i = 0; i < RTAX_MAX; i++) {
 		if (metrics[i]) {
+			if (i == RTAX_CC_ALGO - 1) {
+				char tmp[TCP_CA_NAME_MAX], *name;
+
+				name = tcp_ca_get_name_by_key(metrics[i], tmp);
+				if (!name)
+					continue;
+				if (nla_put_string(skb, i + 1, name))
+					goto nla_put_failure;
+			} else {
+				if (nla_put_u32(skb, i + 1, metrics[i]))
+					goto nla_put_failure;
+			}
 			valid++;
-			if (nla_put_u32(skb, i+1, metrics[i]))
-				goto nla_put_failure;
 		}
 	}
 
diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c
index d332aef..df48034 100644
--- a/net/decnet/dn_fib.c
+++ b/net/decnet/dn_fib.c
@@ -298,7 +298,8 @@ struct dn_fib_info *dn_fib_create_info(const struct rtmsg *r, struct nlattr *att
 			int type = nla_type(attr);
 
 			if (type) {
-				if (type > RTAX_MAX || nla_len(attr) < 4)
+				if (type > RTAX_MAX || type == RTAX_CC_ALGO ||
+				    nla_len(attr) < 4)
 					goto err_inval;
 
 				fi->fib_metrics[type-1] = nla_get_u32(attr);
diff --git a/net/decnet/dn_table.c b/net/decnet/dn_table.c
index 86e3807..3f19fcb 100644
--- a/net/decnet/dn_table.c
+++ b/net/decnet/dn_table.c
@@ -29,6 +29,7 @@
 #include <linux/route.h> /* RTF_xxx */
 #include <net/neighbour.h>
 #include <net/netlink.h>
+#include <net/tcp.h>
 #include <net/dst.h>
 #include <net/flow.h>
 #include <net/fib_rules.h>
@@ -273,7 +274,8 @@ static inline size_t dn_fib_nlmsg_size(struct dn_fib_info *fi)
 	size_t payload = NLMSG_ALIGN(sizeof(struct rtmsg))
 			 + nla_total_size(4) /* RTA_TABLE */
 			 + nla_total_size(2) /* RTA_DST */
-			 + nla_total_size(4); /* RTA_PRIORITY */
+			 + nla_total_size(4) /* RTA_PRIORITY */
+			 + nla_total_size(TCP_CA_NAME_MAX); /* RTAX_CC_ALGO */
 
 	/* space for nested metrics */
 	payload += nla_total_size((RTAX_MAX * nla_total_size(4)));
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index f99f41b..d2b7b55 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -360,7 +360,8 @@ static inline size_t fib_nlmsg_size(struct fib_info *fi)
 			 + nla_total_size(4) /* RTA_TABLE */
 			 + nla_total_size(4) /* RTA_DST */
 			 + nla_total_size(4) /* RTA_PRIORITY */
-			 + nla_total_size(4); /* RTA_PREFSRC */
+			 + nla_total_size(4) /* RTA_PREFSRC */
+			 + nla_total_size(TCP_CA_NAME_MAX); /* RTAX_CC_ALGO */
 
 	/* space for nested metrics */
 	payload += nla_total_size((RTAX_MAX * nla_total_size(4)));
@@ -859,7 +860,16 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 
 				if (type > RTAX_MAX)
 					goto err_inval;
-				val = nla_get_u32(nla);
+				if (type == RTAX_CC_ALGO) {
+					char tmp[TCP_CA_NAME_MAX];
+
+					nla_strlcpy(tmp, nla, sizeof(tmp));
+					val = tcp_ca_get_key_by_name(tmp);
+					if (val == TCP_CA_UNSPEC)
+						goto err_inval;
+				} else {
+					val = nla_get_u32(nla);
+				}
 				if (type == RTAX_ADVMSS && val > 65535 - 40)
 					val = 65535 - 40;
 				if (type == RTAX_MTU && val > 65535 - 15)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 454771d..34dcbb5 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1488,10 +1488,22 @@ static int ip6_convert_metrics(struct mx6_config *mxc,
 		int type = nla_type(nla);
 
 		if (type) {
+			u32 val;
+
 			if (unlikely(type > RTAX_MAX))
 				goto err;
+			if (type == RTAX_CC_ALGO) {
+				char tmp[TCP_CA_NAME_MAX];
+
+				nla_strlcpy(tmp, nla, sizeof(tmp));
+				val = tcp_ca_get_key_by_name(tmp);
+				if (val == TCP_CA_UNSPEC)
+					goto err;
+			} else {
+				val = nla_get_u32(nla);
+			}
 
-			mp[type - 1] = nla_get_u32(nla);
+			mp[type - 1] = val;
 			__set_bit(type - 1, mxc->mx_valid);
 		}
 	}
@@ -2571,7 +2583,8 @@ static inline size_t rt6_nlmsg_size(void)
 	       + nla_total_size(4) /* RTA_OIF */
 	       + nla_total_size(4) /* RTA_PRIORITY */
 	       + RTAX_MAX * nla_total_size(4) /* RTA_METRICS */
-	       + nla_total_size(sizeof(struct rta_cacheinfo));
+	       + nla_total_size(sizeof(struct rta_cacheinfo))
+	       + nla_total_size(TCP_CA_NAME_MAX); /* RTAX_CC_ALGO */
 }
 
 static int rt6_fill_node(struct net *net,
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v3 0/6] net: allow setting congctl via routing table
From: Daniel Borkmann @ 2015-01-05 22:57 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev

This is the second part of our work and allows for setting the congestion
control algorithm via routing table. For details, please see individual
patches.

Since patch 1 is a bug fix, we suggest applying patch 1 to net, and then
merging net into net-next, for example, and following up with the remaining
feature patches wrt dependencies.

Joint work with Florian Westphal, suggested by Hannes Frederic Sowa.

Patch for iproute2 is available under [1], but will be reposted with along
with the man-page update when this set hits net-next.

  [1] http://patchwork.ozlabs.org/patch/418149/

Thanks!

v2 -> v3:
 - Added module auto-loading as suggested by David Miller, thanks!
  - Added patch 2 for handling possible sleeps in fib6
  - While working on this, we discovered a bug, hence fix in patch 1
  - Added auto-loading to patch 4
 - Rebased, retested, rest the same.
v1 -> v2:
 - Very sorry, I noticed I had decnet disabled during testing.
   Added missing header include in decnet, rest as is.

Daniel Borkmann (5):
  net: fib6: fib6_commit_metrics: fix potential NULL pointer dereference
  net: tcp: refactor reinitialization of congestion control
  net: tcp: add key management to congestion control
  net: tcp: add RTAX_CC_ALGO fib handling
  net: tcp: add per route congestion control

Florian Westphal (1):
  net: fib6: convert cfg metric to u32 outside of table write lock

 include/net/inet_connection_sock.h |   3 +-
 include/net/ip6_fib.h              |  10 ++-
 include/net/tcp.h                  |  22 ++++++-
 include/uapi/linux/rtnetlink.h     |   2 +
 net/core/rtnetlink.c               |  15 ++++-
 net/decnet/dn_fib.c                |   3 +-
 net/decnet/dn_table.c              |   4 +-
 net/ipv4/fib_semantics.c           |  14 ++++-
 net/ipv4/tcp_cong.c                | 121 +++++++++++++++++++++++++++++--------
 net/ipv4/tcp_ipv4.c                |   2 +
 net/ipv4/tcp_minisocks.c           |  30 +++++++--
 net/ipv4/tcp_output.c              |  21 +++++++
 net/ipv6/ip6_fib.c                 |  68 +++++++++++----------
 net/ipv6/route.c                   |  72 ++++++++++++++++++----
 net/ipv6/tcp_ipv6.c                |   2 +
 15 files changed, 304 insertions(+), 85 deletions(-)

-- 
1.7.11.7

^ permalink raw reply

* [PATCH net-next v3 4/6] net: tcp: add key management to congestion control
From: Daniel Borkmann @ 2015-01-05 22:57 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev
In-Reply-To: <1420498668-4660-1-git-send-email-dborkman@redhat.com>

This patch adds necessary infrastructure to the congestion control
framework for later per route congestion control support.

For a per route congestion control possibility, our aim is to store
a unique u32 key identifier into dst metrics, which can then be
mapped into a tcp_congestion_ops struct. We argue that having a
RTAX key entry is the most simple, generic and easy way to manage,
and also keeps the memory footprint of dst entries lower on 64 bit
than with storing a pointer directly, for example. Having a unique
key id also allows for decoupling actual TCP congestion control
module management from the FIB layer, i.e. we don't have to care
about expensive module refcounting inside the FIB at this point.

We first thought of using an IDR store for the realization, which
takes over dynamic assignment of unused key space and also performs
the key to pointer mapping in RCU. While doing so, we stumbled upon
the issue that due to the nature of dynamic key distribution, it
just so happens, arguably in very rare occasions, that excessive
module loads and unloads can lead to a possible reuse of previously
used key space. Thus, previously stale keys in the dst metric are
now being reassigned to a different congestion control algorithm,
which might lead to unexpected behaviour. One way to resolve this
would have been to walk FIBs on the actually rare occasion of a
module unload and reset the metric keys for each FIB in each netns,
but that's just very costly.

Therefore, we argue a better solution is to reuse the unique
congestion control algorithm name member and map that into u32 key
space through jhash. For that, we split the flags attribute (as it
currently uses 2 bits only anyway) into two u32 attributes, flags
and key, so that we can keep the cacheline boundary of 2 cachelines
on x86_64 and cache the precalculated key at registration time for
the fast path. On average we might expect 2 - 4 modules being loaded
worst case perhaps 15, so a key collision possibility is extremely
low, and guaranteed collision-free on LE/BE for all in-tree modules.
Overall this results in much simpler code, and all without the
overhead of an IDR. Due to the deterministic nature, modules can
now be unloaded, the congestion control algorithm for a specific
but unloaded key will fall back to the default one, and on module
reload time it will switch back to the expected algorithm
transparently.

Joint work with Florian Westphal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/net/inet_connection_sock.h |  3 +-
 include/net/tcp.h                  |  9 +++-
 net/ipv4/tcp_cong.c                | 97 +++++++++++++++++++++++++++++++-------
 3 files changed, 91 insertions(+), 18 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 848e85c..5976bde 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -98,7 +98,8 @@ struct inet_connection_sock {
 	const struct tcp_congestion_ops *icsk_ca_ops;
 	const struct inet_connection_sock_af_ops *icsk_af_ops;
 	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
-	__u8			  icsk_ca_state;
+	__u8			  icsk_ca_state:7,
+				  icsk_ca_dst_locked:1;
 	__u8			  icsk_retransmits;
 	__u8			  icsk_pending;
 	__u8			  icsk_backoff;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f50f29faf..135b70c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -787,6 +787,8 @@ enum tcp_ca_ack_event_flags {
 #define TCP_CA_MAX	128
 #define TCP_CA_BUF_MAX	(TCP_CA_NAME_MAX*TCP_CA_MAX)
 
+#define TCP_CA_UNSPEC	0
+
 /* Algorithm can be set on socket without CAP_NET_ADMIN privileges */
 #define TCP_CONG_NON_RESTRICTED 0x1
 /* Requires ECN/ECT set on all packets */
@@ -794,7 +796,8 @@ enum tcp_ca_ack_event_flags {
 
 struct tcp_congestion_ops {
 	struct list_head	list;
-	unsigned long flags;
+	u32 key;
+	u32 flags;
 
 	/* initialize private data (optional) */
 	void (*init)(struct sock *sk);
@@ -841,6 +844,10 @@ u32 tcp_reno_ssthresh(struct sock *sk);
 void tcp_reno_cong_avoid(struct sock *sk, u32 ack, u32 acked);
 extern struct tcp_congestion_ops tcp_reno;
 
+struct tcp_congestion_ops *tcp_ca_find_key(u32 key);
+u32 tcp_ca_get_key_by_name(const char *name);
+char *tcp_ca_get_name_by_key(u32 key, char *buffer);
+
 static inline bool tcp_ca_needs_ecn(const struct sock *sk)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 38f2f8a..63c29db 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -13,6 +13,7 @@
 #include <linux/types.h>
 #include <linux/list.h>
 #include <linux/gfp.h>
+#include <linux/jhash.h>
 #include <net/tcp.h>
 
 static DEFINE_SPINLOCK(tcp_cong_list_lock);
@@ -31,6 +32,34 @@ static struct tcp_congestion_ops *tcp_ca_find(const char *name)
 	return NULL;
 }
 
+/* Must be called with rcu lock held */
+static const struct tcp_congestion_ops *__tcp_ca_find_autoload(const char *name)
+{
+	const struct tcp_congestion_ops *ca = tcp_ca_find(name);
+#ifdef CONFIG_MODULES
+	if (!ca && capable(CAP_NET_ADMIN)) {
+		rcu_read_unlock();
+		request_module("tcp_%s", name);
+		rcu_read_lock();
+		ca = tcp_ca_find(name);
+	}
+#endif
+	return ca;
+}
+
+/* Simple linear search, not much in here. */
+struct tcp_congestion_ops *tcp_ca_find_key(u32 key)
+{
+	struct tcp_congestion_ops *e;
+
+	list_for_each_entry_rcu(e, &tcp_cong_list, list) {
+		if (e->key == key)
+			return e;
+	}
+
+	return NULL;
+}
+
 /*
  * Attach new congestion control algorithm to the list
  * of available options.
@@ -45,9 +74,12 @@ int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
 		return -EINVAL;
 	}
 
+	ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name));
+
 	spin_lock(&tcp_cong_list_lock);
-	if (tcp_ca_find(ca->name)) {
-		pr_notice("%s already registered\n", ca->name);
+	if (ca->key == TCP_CA_UNSPEC || tcp_ca_find_key(ca->key)) {
+		pr_notice("%s already registered or non-unique key\n",
+			  ca->name);
 		ret = -EEXIST;
 	} else {
 		list_add_tail_rcu(&ca->list, &tcp_cong_list);
@@ -70,9 +102,50 @@ void tcp_unregister_congestion_control(struct tcp_congestion_ops *ca)
 	spin_lock(&tcp_cong_list_lock);
 	list_del_rcu(&ca->list);
 	spin_unlock(&tcp_cong_list_lock);
+
+	/* Wait for outstanding readers to complete before the
+	 * module gets removed entirely.
+	 *
+	 * A try_module_get() should fail by now as our module is
+	 * in "going" state since no refs are held anymore and
+	 * module_exit() handler being called.
+	 */
+	synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(tcp_unregister_congestion_control);
 
+u32 tcp_ca_get_key_by_name(const char *name)
+{
+	const struct tcp_congestion_ops *ca;
+	u32 key;
+
+	might_sleep();
+
+	rcu_read_lock();
+	ca = __tcp_ca_find_autoload(name);
+	key = ca ? ca->key : TCP_CA_UNSPEC;
+	rcu_read_unlock();
+
+	return key;
+}
+EXPORT_SYMBOL_GPL(tcp_ca_get_key_by_name);
+
+char *tcp_ca_get_name_by_key(u32 key, char *buffer)
+{
+	const struct tcp_congestion_ops *ca;
+	char *ret = NULL;
+
+	rcu_read_lock();
+	ca = tcp_ca_find_key(key);
+	if (ca)
+		ret = strncpy(buffer, ca->name,
+			      TCP_CA_NAME_MAX);
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tcp_ca_get_name_by_key);
+
 /* Assign choice of congestion control. */
 void tcp_assign_congestion_control(struct sock *sk)
 {
@@ -253,25 +326,17 @@ out:
 int tcp_set_congestion_control(struct sock *sk, const char *name)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
-	struct tcp_congestion_ops *ca;
+	const struct tcp_congestion_ops *ca;
 	int err = 0;
 
-	rcu_read_lock();
-	ca = tcp_ca_find(name);
+	if (icsk->icsk_ca_dst_locked)
+		return -EPERM;
 
-	/* no change asking for existing value */
+	rcu_read_lock();
+	ca = __tcp_ca_find_autoload(name);
+	/* No change asking for existing value */
 	if (ca == icsk->icsk_ca_ops)
 		goto out;
-
-#ifdef CONFIG_MODULES
-	/* not found attempt to autoload module */
-	if (!ca && capable(CAP_NET_ADMIN)) {
-		rcu_read_unlock();
-		request_module("tcp_%s", name);
-		rcu_read_lock();
-		ca = tcp_ca_find(name);
-	}
-#endif
 	if (!ca)
 		err = -ENOENT;
 	else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) ||
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v3 2/6] net: fib6: convert cfg metric to u32 outside of table write lock
From: Daniel Borkmann @ 2015-01-05 22:57 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev
In-Reply-To: <1420498668-4660-1-git-send-email-dborkman@redhat.com>

From: Florian Westphal <fw@strlen.de>

Do the nla validation earlier, outside the write lock.

This is needed by followup patch which needs to be able to call
request_module (which can sleep) if needed.

Joint work with Daniel Borkmann.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/ip6_fib.h | 10 +++++---
 net/ipv6/ip6_fib.c    | 69 +++++++++++++++++++++++++++------------------------
 net/ipv6/route.c      | 57 ++++++++++++++++++++++++++++++++++--------
 3 files changed, 90 insertions(+), 46 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 8eea35d..20e80fa 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -74,6 +74,11 @@ struct fib6_node {
 #define FIB6_SUBTREE(fn)	((fn)->subtree)
 #endif
 
+struct mx6_config {
+	const u32 *mx;
+	DECLARE_BITMAP(mx_valid, RTAX_MAX);
+};
+
 /*
  *	routing information
  *
@@ -291,9 +296,8 @@ struct fib6_node *fib6_locate(struct fib6_node *root,
 void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *arg),
 		    void *arg);
 
-int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
-	     struct nlattr *mx, int mx_len);
-
+int fib6_add(struct fib6_node *root, struct rt6_info *rt,
+	     struct nl_info *info, struct mx6_config *mxc);
 int fib6_del(struct rt6_info *rt, struct nl_info *info);
 
 void inet6_rt_notify(int event, struct rt6_info *rt, struct nl_info *info);
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index db4984e..03c520a 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -630,31 +630,35 @@ static bool rt6_qualify_for_ecmp(struct rt6_info *rt)
 	       RTF_GATEWAY;
 }
 
-static int fib6_commit_metrics(struct dst_entry *dst,
-			       struct nlattr *mx, int mx_len)
+static void fib6_copy_metrics(u32 *mp, const struct mx6_config *mxc)
 {
-	bool dst_host = dst->flags & DST_HOST;
-	struct nlattr *nla;
-	int remaining;
-	u32 *mp;
+	int i;
 
-	mp = dst_host ? dst_metrics_write_ptr(dst) :
-			kzalloc(sizeof(u32) * RTAX_MAX, GFP_ATOMIC);
-	if (unlikely(!mp))
-		return -ENOMEM;
-	if (!dst_host)
-		dst_init_metrics(dst, mp, 0);
+	for (i = 0; i < RTAX_MAX; i++) {
+		if (test_bit(i, mxc->mx_valid))
+			mp[i] = mxc->mx[i];
+	}
+}
+
+static int fib6_commit_metrics(struct dst_entry *dst, struct mx6_config *mxc)
+{
+	if (!mxc->mx)
+		return 0;
 
-	nla_for_each_attr(nla, mx, mx_len, remaining) {
-		int type = nla_type(nla);
+	if (dst->flags & DST_HOST) {
+		u32 *mp = dst_metrics_write_ptr(dst);
 
-		if (type) {
-			if (type > RTAX_MAX)
-				return -EINVAL;
+		if (unlikely(!mp))
+			return -ENOMEM;
 
-			mp[type - 1] = nla_get_u32(nla);
-		}
+		fib6_copy_metrics(mp, mxc);
+	} else {
+		dst_init_metrics(dst, mxc->mx, false);
+
+		/* We've stolen mx now. */
+		mxc->mx = NULL;
 	}
+
 	return 0;
 }
 
@@ -663,7 +667,7 @@ static int fib6_commit_metrics(struct dst_entry *dst,
  */
 
 static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
-			    struct nl_info *info, struct nlattr *mx, int mx_len)
+			    struct nl_info *info, struct mx6_config *mxc)
 {
 	struct rt6_info *iter = NULL;
 	struct rt6_info **ins;
@@ -772,11 +776,10 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
 			pr_warn("NLM_F_CREATE should be set when creating new route\n");
 
 add:
-		if (mx) {
-			err = fib6_commit_metrics(&rt->dst, mx, mx_len);
-			if (err)
-				return err;
-		}
+		err = fib6_commit_metrics(&rt->dst, mxc);
+		if (err)
+			return err;
+
 		rt->dst.rt6_next = iter;
 		*ins = rt;
 		rt->rt6i_node = fn;
@@ -796,11 +799,11 @@ add:
 			pr_warn("NLM_F_REPLACE set, but no existing node found!\n");
 			return -ENOENT;
 		}
-		if (mx) {
-			err = fib6_commit_metrics(&rt->dst, mx, mx_len);
-			if (err)
-				return err;
-		}
+
+		err = fib6_commit_metrics(&rt->dst, mxc);
+		if (err)
+			return err;
+
 		*ins = rt;
 		rt->rt6i_node = fn;
 		rt->dst.rt6_next = iter->dst.rt6_next;
@@ -837,8 +840,8 @@ void fib6_force_start_gc(struct net *net)
  *	with source addr info in sub-trees
  */
 
-int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
-	     struct nlattr *mx, int mx_len)
+int fib6_add(struct fib6_node *root, struct rt6_info *rt,
+	     struct nl_info *info, struct mx6_config *mxc)
 {
 	struct fib6_node *fn, *pn = NULL;
 	int err = -ENOMEM;
@@ -933,7 +936,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
 	}
 #endif
 
-	err = fib6_add_rt2node(fn, rt, info, mx, mx_len);
+	err = fib6_add_rt2node(fn, rt, info, mxc);
 	if (!err) {
 		fib6_start_gc(info->nl_net, rt);
 		if (!(rt->rt6i_flags & RTF_CACHE))
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index c910831..454771d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -853,14 +853,14 @@ EXPORT_SYMBOL(rt6_lookup);
  */
 
 static int __ip6_ins_rt(struct rt6_info *rt, struct nl_info *info,
-			struct nlattr *mx, int mx_len)
+			struct mx6_config *mxc)
 {
 	int err;
 	struct fib6_table *table;
 
 	table = rt->rt6i_table;
 	write_lock_bh(&table->tb6_lock);
-	err = fib6_add(&table->tb6_root, rt, info, mx, mx_len);
+	err = fib6_add(&table->tb6_root, rt, info, mxc);
 	write_unlock_bh(&table->tb6_lock);
 
 	return err;
@@ -868,10 +868,10 @@ static int __ip6_ins_rt(struct rt6_info *rt, struct nl_info *info,
 
 int ip6_ins_rt(struct rt6_info *rt)
 {
-	struct nl_info info = {
-		.nl_net = dev_net(rt->dst.dev),
-	};
-	return __ip6_ins_rt(rt, &info, NULL, 0);
+	struct nl_info info = {	.nl_net = dev_net(rt->dst.dev), };
+	struct mx6_config mxc = { .mx = NULL, };
+
+	return __ip6_ins_rt(rt, &info, &mxc);
 }
 
 static struct rt6_info *rt6_alloc_cow(struct rt6_info *ort,
@@ -1470,9 +1470,39 @@ out:
 	return entries > rt_max_size;
 }
 
-/*
- *
- */
+static int ip6_convert_metrics(struct mx6_config *mxc,
+			       const struct fib6_config *cfg)
+{
+	struct nlattr *nla;
+	int remaining;
+	u32 *mp;
+
+	if (cfg->fc_mx == NULL)
+		return 0;
+
+	mp = kzalloc(sizeof(u32) * RTAX_MAX, GFP_KERNEL);
+	if (unlikely(!mp))
+		return -ENOMEM;
+
+	nla_for_each_attr(nla, cfg->fc_mx, cfg->fc_mx_len, remaining) {
+		int type = nla_type(nla);
+
+		if (type) {
+			if (unlikely(type > RTAX_MAX))
+				goto err;
+
+			mp[type - 1] = nla_get_u32(nla);
+			__set_bit(type - 1, mxc->mx_valid);
+		}
+	}
+
+	mxc->mx = mp;
+
+	return 0;
+ err:
+	kfree(mp);
+	return -EINVAL;
+}
 
 int ip6_route_add(struct fib6_config *cfg)
 {
@@ -1482,6 +1512,7 @@ int ip6_route_add(struct fib6_config *cfg)
 	struct net_device *dev = NULL;
 	struct inet6_dev *idev = NULL;
 	struct fib6_table *table;
+	struct mx6_config mxc = { .mx = NULL, };
 	int addr_type;
 
 	if (cfg->fc_dst_len > 128 || cfg->fc_src_len > 128)
@@ -1677,8 +1708,14 @@ install_route:
 
 	cfg->fc_nlinfo.nl_net = dev_net(dev);
 
-	return __ip6_ins_rt(rt, &cfg->fc_nlinfo, cfg->fc_mx, cfg->fc_mx_len);
+	err = ip6_convert_metrics(&mxc, cfg);
+	if (err)
+		goto out;
+
+	err = __ip6_ins_rt(rt, &cfg->fc_nlinfo, &mxc);
 
+	kfree(mxc.mx);
+	return err;
 out:
 	if (dev)
 		dev_put(dev);
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v3 3/6] net: tcp: refactor reinitialization of congestion control
From: Daniel Borkmann @ 2015-01-05 22:57 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev
In-Reply-To: <1420498668-4660-1-git-send-email-dborkman@redhat.com>

We can just move this to an extra function and make the code
a bit more readable, no functional change.

Joint work with Florian Westphal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 net/ipv4/tcp_cong.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 27ead0d..38f2f8a 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -107,6 +107,18 @@ void tcp_init_congestion_control(struct sock *sk)
 		icsk->icsk_ca_ops->init(sk);
 }
 
+static void tcp_reinit_congestion_control(struct sock *sk,
+					  const struct tcp_congestion_ops *ca)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	tcp_cleanup_congestion_control(sk);
+	icsk->icsk_ca_ops = ca;
+
+	if (sk->sk_state != TCP_CLOSE && icsk->icsk_ca_ops->init)
+		icsk->icsk_ca_ops->init(sk);
+}
+
 /* Manage refcounts on socket close. */
 void tcp_cleanup_congestion_control(struct sock *sk)
 {
@@ -262,21 +274,13 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
 #endif
 	if (!ca)
 		err = -ENOENT;
-
 	else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) ||
 		   ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)))
 		err = -EPERM;
-
 	else if (!try_module_get(ca->owner))
 		err = -EBUSY;
-
-	else {
-		tcp_cleanup_congestion_control(sk);
-		icsk->icsk_ca_ops = ca;
-
-		if (sk->sk_state != TCP_CLOSE && icsk->icsk_ca_ops->init)
-			icsk->icsk_ca_ops->init(sk);
-	}
+	else
+		tcp_reinit_congestion_control(sk, ca);
  out:
 	rcu_read_unlock();
 	return err;
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH] Revert "ipw2200: select CFG80211_WEXT"
From: Arend van Spriel @ 2015-01-05 22:13 UTC (permalink / raw)
  To: Paul Bolle
  Cc: Johannes Berg, Linus Torvalds, Marcel Holtmann,
	Stanislav Yakovlev, Kalle Valo, Jiri Kosina, linux-wireless,
	Network Development, Linux Kernel Mailing List
In-Reply-To: <1420495519.14308.29.camel@x220>

On 01/05/15 23:05, Paul Bolle wrote:
> On Mon, 2015-01-05 at 19:57 +0100, Johannes Berg wrote:
>> Multiple other groups of ioctls could be converted in similar patches,
>> until at the end you can completely remove ipw_wx_handlers and rely
>> entirely on cfg80211's wext compatibility.
>>
>> So far the theory - in practice nobody cared enough to start working on
>> any of these drivers, let alone actually has the hardware today.
>
> So my suggestion to make ipw2200 no longer use cfg80211_wext_giwname()
> would actually be backwards. What's actually needed, in theory, is to
> use more of what's provided under CFG80211_WEXT (and, I guess, less of
> what's provided under WIRELESS_EXT). Did I get that right?

Yes, but as Johannes indicated it needs consideration what to group in 
the patches.

Regards,
Arend

> Thanks,
>
>
> Paul Bolle
>

^ permalink raw reply

* Re: [PATCH/RFC rocker-net-next 1/6] net: flow: Cancel innermost nested attribute first
From: David Miller @ 2015-01-05 22:10 UTC (permalink / raw)
  To: tgraf; +Cc: simon.horman, john.fastabend, netdev
In-Reply-To: <20150105220113.GD31637@casper.infradead.org>

From: Thomas Graf <tgraf@suug.ch>
Date: Mon, 5 Jan 2015 22:01:13 +0000

> Thinking about this I'll send a patch to WARN() in nlmsg_trim() on
> mark < skb->data.

+1

^ permalink raw reply

* Re: [PATCH] Revert "ipw2200: select CFG80211_WEXT"
From: Paul Bolle @ 2015-01-05 22:05 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Arend van Spriel, Linus Torvalds, Marcel Holtmann,
	Stanislav Yakovlev, Kalle Valo, Jiri Kosina, linux-wireless,
	Network Development, Linux Kernel Mailing List
In-Reply-To: <1420484224.9459.16.camel-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>

On Mon, 2015-01-05 at 19:57 +0100, Johannes Berg wrote:
> Multiple other groups of ioctls could be converted in similar patches,
> until at the end you can completely remove ipw_wx_handlers and rely
> entirely on cfg80211's wext compatibility.
> 
> So far the theory - in practice nobody cared enough to start working on
> any of these drivers, let alone actually has the hardware today.

So my suggestion to make ipw2200 no longer use cfg80211_wext_giwname()
would actually be backwards. What's actually needed, in theory, is to
use more of what's provided under CFG80211_WEXT (and, I guess, less of
what's provided under WIRELESS_EXT). Did I get that right?

Thanks,


Paul Bolle

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH/RFC rocker-net-next 1/6] net: flow: Cancel innermost nested attribute first
From: Thomas Graf @ 2015-01-05 22:01 UTC (permalink / raw)
  To: David Miller; +Cc: simon.horman, john.fastabend, netdev
In-Reply-To: <20150105.161725.1765207203472571760.davem@davemloft.net>

On 01/05/15 at 04:17pm, David Miller wrote:
> From: Simon Horman <simon.horman@netronome.com>
> Date: Mon,  5 Jan 2015 15:50:05 +0900
> 
> > Cancel innermost nested attribute first on error when putting flow actions.
> > 
> > Signed-off-by: Simon Horman <simon.horman@netronome.com>
> > 
> > ---
> > 
> > Its unclear to me if this makes any difference.
> > But it seems more logical to me.
> 
> Hmmm.  Be careful here.  nla_nest_cancel() is just rolling back
> the length of the SKB to right before the netlink attribute being
> given as the cancellation point.
> 
> So you really have to cancel attributes in exactly the reverse order
> in which they were added.  Otherwise we'll make a trim call with a
> negative adjustment that actually expands the SKB past an already
> cancelled attribute.

As I told John in the other the other thread: It is often clearer
and less error prone to only ever cancel the outer most nesting
level if we are undoing all of them anyway. Dave is absolutely right,
a wrong order will lead to bugs which are hard to track down.
Thinking about this I'll send a patch to WARN() in nlmsg_trim() on
mark < skb->data.

^ permalink raw reply

* Re: WIP alternative - was Re: [PATCH v3 14/20] selftests/size: add install target to enable test install
From: Tim Bird @ 2015-01-05 21:56 UTC (permalink / raw)
  To: Shuah Khan, mmarek-AlSwsSmVLrQ@public.gmane.org,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org,
	mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org,
	keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org,
	tranmanphong-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org,
	cov-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org,
	dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	bobby.prani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org,
	josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org,
	koct9i-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
  Cc: linux-kbuild-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <54AB0215.5020409-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>

On 01/05/2015 01:28 PM, Shuah Khan wrote:
> On 12/31/2014 07:31 PM, Tim Bird wrote:
...
>> The install phase is desperately needed for usage of kselftest in
>> cross-target situations (applicable to almost all embedded).  So this
>> is great stuff.
> 
> Thanks.
> 
>>
>> I worked a bit on isolating the install stuff to a makefile include file.
>> This allows simplifying some of the sub-level Makefiles a bit, and allowing
>> control of some of the install and run logic in less places.
>>
>> This is still a work in progress, but before I got too far along, I wanted
>> to post it for people to provide feedback.  A couple of problems cropped
>> up that are worth discussing, IMHO.
>>
>> 1) I think it should be a requirement that each test has a single
>> "main" program to execute to run the tests.  If multiple tests are supported
>> or more flexibility is desired for additional arguments, or that sort of
>> thing, then that's fine, but the automated script builder should be able
>> to run just a single program or script to have things work.  This also
>> makes things more consistent.  In the case of the firmware test, I created
>> a single fw_both.sh script to do this, instead of having two separate
>> blocks in the kselftest.sh script.
> 
> It is a good goal for individual tests to use a main program to run
> tests, even though, I would not make it a requirement. I would like to
> leave that decision up to the individual test writer.
>
OK.  It helps to have a single line when trying to isolate
RUN_TEST creation into the include file, but there may be other
ways to accomplish this.

>>
>> 2) I've added a CROSS_INSTALL variable, which can call an arbitrary program
>> to place files on the target system (rather than just calling 'install').
>> In my case, I'd use my own 'ttc cp' command, which I can extend as necessary
>> to put things on a remote machine.  This works for a single directory,
>> but things get dicier with sub-directory trees full of files (like
>> the ftrace test uses.)
>>
>> If additional items need to be installed to the target, then maybe a setup
>> program should be used, rather than just copying files.
>>
>> 3) Some of the scripts were using /bin/bash to execute them, rather
>> than rely on the interpreter line in the script itself (and having
>> the script have executable privileges).  Is there a reason for this?
>> I modified a few scripts to be executable, and got rid of the
>> explicit execution with /bin/bash.
> 
> Probably no reason other than the choice made by the test writer.
> It could be cleaned up and made consistent, however, I would see
> this as a separate enhancement type work that could be done on its
> own and not include it in the install work.

OK - this was also something that simplified the creation
of the RUN_TEST variable in the isolated include file.
Also, having the interpreter explicit in the invocation line
in the Makefile as well as in the script itself is a bit redundant.
>>
>> The following is just a start...  Let me know if this direction looks
>> OK, and I'll finish this up.  The main item to look at is
>> kselftest.include file.  Note that these patches are based on Shuah's
>> series - but if you want to use these ideas I can rebase them onto
>> mainline, and break them out per test sub-level like Shuah did.
> 
> One of the reasons I picked install target approach is to enable the
> feature by extending the existing run_tests support. This way we will
> have the feature available quickly. Once that is supported, we can work
> on evolving to a generic approach to use the include file approach, as
> the changes you are proposing are based on the the series I sent out,
> and makes improvements to it.
> 
> kselftest.include file approach could work for several tests and tests
> that can't use the generic could add their own install support.
> 
> I propose evolving to a generic kselftest.include as the second step in
> evolving the install feature. Can I count on you do the work and update
> the tests to use kselftest.include, CROSS_INSTALL variable support?

Yes.  I'd be happy to evolve it through phases to support the include
file and cross-target install feature.

Is there anything I can help with in the mean time?  Some of the tests
require a directory tree of files rather than just a few top-level files
(e.g. ftrace).

I was thinking about doing some work to tar-ify the needed directories of
data files during build, and untar it in the execution area during the
install step.  Do you want me to propose something there?
 -- Tim

^ permalink raw reply

* [PATCH net-next v3 4/4] ip: Add offset parameter to ip_cmsg_recv
From: Tom Herbert @ 2015-01-05 21:56 UTC (permalink / raw)
  To: davem, netdev
In-Reply-To: <1420494977-15026-1-git-send-email-therbert@google.com>

Add ip_cmsg_recv_offset function which takes an offset argument
that indicates the starting offset in skb where data is being received
from. This will be useful in the case of UDP and provided checksum
to user space.

ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
zero.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/net/inet_sock.h |  1 +
 include/uapi/linux/in.h |  1 +
 net/ipv4/ip_sockglue.c  | 41 ++++++++++++++++++++++++++++++++++++++++-
 net/ipv4/udp.c          |  2 +-
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 605ca42..eb16c7b 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -203,6 +203,7 @@ struct inet_sock {
 #define IP_CMSG_RETOPTS		BIT(4)
 #define IP_CMSG_PASSSEC		BIT(5)
 #define IP_CMSG_ORIGDSTADDR	BIT(6)
+#define IP_CMSG_CHECKSUM	BIT(7)
 
 static inline struct inet_sock *inet_sk(const struct sock *sk)
 {
diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
index c33a65e..589ced0 100644
--- a/include/uapi/linux/in.h
+++ b/include/uapi/linux/in.h
@@ -109,6 +109,7 @@ struct in_addr {
 
 #define IP_MINTTL       21
 #define IP_NODEFRAG     22
+#define IP_CHECKSUM	23
 
 /* IP_MTU_DISCOVER values */
 #define IP_PMTUDISC_DONT		0	/* Never send DF frames */
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 513d506..a317797 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -37,6 +37,7 @@
 #include <net/route.h>
 #include <net/xfrm.h>
 #include <net/compat.h>
+#include <net/checksum.h>
 #if IS_ENABLED(CONFIG_IPV6)
 #include <net/transp_v6.h>
 #endif
@@ -96,6 +97,20 @@ static void ip_cmsg_recv_retopts(struct msghdr *msg, struct sk_buff *skb)
 	put_cmsg(msg, SOL_IP, IP_RETOPTS, opt->optlen, opt->__data);
 }
 
+static void ip_cmsg_recv_checksum(struct msghdr *msg, struct sk_buff *skb,
+				  int offset)
+{
+	__wsum csum = skb->csum;
+
+	if (skb->ip_summed != CHECKSUM_COMPLETE)
+		return;
+
+	if (offset != 0)
+		csum = csum_sub(csum, csum_partial(skb->data, offset, 0));
+
+	put_cmsg(msg, SOL_IP, IP_CHECKSUM, sizeof(__wsum), &csum);
+}
+
 static void ip_cmsg_recv_security(struct msghdr *msg, struct sk_buff *skb)
 {
 	char *secdata;
@@ -191,9 +206,16 @@ void ip_cmsg_recv_offset(struct msghdr *msg, struct sk_buff *skb,
 			return;
 	}
 
-	if (flags & IP_CMSG_ORIGDSTADDR)
+	if (flags & IP_CMSG_ORIGDSTADDR) {
 		ip_cmsg_recv_dstaddr(msg, skb);
 
+		flags &= ~IP_CMSG_ORIGDSTADDR;
+		if (!flags)
+			return;
+	}
+
+	if (flags & IP_CMSG_CHECKSUM)
+		ip_cmsg_recv_checksum(msg, skb, offset);
 }
 EXPORT_SYMBOL(ip_cmsg_recv_offset);
 
@@ -533,6 +555,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 	case IP_MULTICAST_ALL:
 	case IP_MULTICAST_LOOP:
 	case IP_RECVORIGDSTADDR:
+	case IP_CHECKSUM:
 		if (optlen >= sizeof(int)) {
 			if (get_user(val, (int __user *) optval))
 				return -EFAULT;
@@ -630,6 +653,19 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 		else
 			inet->cmsg_flags &= ~IP_CMSG_ORIGDSTADDR;
 		break;
+	case IP_CHECKSUM:
+		if (val) {
+			if (!(inet->cmsg_flags & IP_CMSG_CHECKSUM)) {
+				inet_inc_convert_csum(sk);
+				inet->cmsg_flags |= IP_CMSG_CHECKSUM;
+			}
+		} else {
+			if (inet->cmsg_flags & IP_CMSG_CHECKSUM) {
+				inet_dec_convert_csum(sk);
+				inet->cmsg_flags &= ~IP_CMSG_CHECKSUM;
+			}
+		}
+		break;
 	case IP_TOS:	/* This sets both TOS and Precedence */
 		if (sk->sk_type == SOCK_STREAM) {
 			val &= ~INET_ECN_MASK;
@@ -1233,6 +1269,9 @@ static int do_ip_getsockopt(struct sock *sk, int level, int optname,
 	case IP_RECVORIGDSTADDR:
 		val = (inet->cmsg_flags & IP_CMSG_ORIGDSTADDR) != 0;
 		break;
+	case IP_CHECKSUM:
+		val = (inet->cmsg_flags & IP_CMSG_CHECKSUM) != 0;
+		break;
 	case IP_TOS:
 		val = inet->tos;
 		break;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 53358d8..97ef1f8b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1329,7 +1329,7 @@ try_again:
 		*addr_len = sizeof(*sin);
 	}
 	if (inet->cmsg_flags)
-		ip_cmsg_recv(msg, skb);
+		ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
 
 	err = copied;
 	if (flags & MSG_TRUNC)
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next v3 3/4] ip: Add offset parameter to ip_cmsg_recv
From: Tom Herbert @ 2015-01-05 21:56 UTC (permalink / raw)
  To: davem, netdev
In-Reply-To: <1420494977-15026-1-git-send-email-therbert@google.com>

Add ip_cmsg_recv_offset function which takes an offset argument
that indicates the starting offset in skb where data is being received
from. This will be useful in the case of UDP and provided checksum
to user space.

ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
zero.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/net/ip.h       | 7 ++++++-
 net/ipv4/ip_sockglue.c | 5 +++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 0bb6207..0e5a0ba 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -537,7 +537,7 @@ int ip_options_rcv_srr(struct sk_buff *skb);
  */
 
 void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb);
-void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb);
+void ip_cmsg_recv_offset(struct msghdr *msg, struct sk_buff *skb, int offset);
 int ip_cmsg_send(struct net *net, struct msghdr *msg,
 		 struct ipcm_cookie *ipc, bool allow_ipv6);
 int ip_setsockopt(struct sock *sk, int level, int optname, char __user *optval,
@@ -557,6 +557,11 @@ void ip_icmp_error(struct sock *sk, struct sk_buff *skb, int err, __be16 port,
 void ip_local_error(struct sock *sk, int err, __be32 daddr, __be16 dport,
 		    u32 info);
 
+static inline void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb)
+{
+	ip_cmsg_recv_offset(msg, skb, 0);
+}
+
 bool icmp_global_allow(void);
 extern int sysctl_icmp_msgs_per_sec;
 extern int sysctl_icmp_msgs_burst;
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 80f7856..513d506 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -136,7 +136,8 @@ static void ip_cmsg_recv_dstaddr(struct msghdr *msg, struct sk_buff *skb)
 	put_cmsg(msg, SOL_IP, IP_ORIGDSTADDR, sizeof(sin), &sin);
 }
 
-void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb)
+void ip_cmsg_recv_offset(struct msghdr *msg, struct sk_buff *skb,
+			 int offset)
 {
 	struct inet_sock *inet = inet_sk(skb->sk);
 	unsigned int flags = inet->cmsg_flags;
@@ -194,7 +195,7 @@ void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb)
 		ip_cmsg_recv_dstaddr(msg, skb);
 
 }
-EXPORT_SYMBOL(ip_cmsg_recv);
+EXPORT_SYMBOL(ip_cmsg_recv_offset);
 
 int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc,
 		 bool allow_ipv6)
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next v3 1/4] ip: Move checksum convert defines to inet
From: Tom Herbert @ 2015-01-05 21:56 UTC (permalink / raw)
  To: davem, netdev
In-Reply-To: <1420494977-15026-1-git-send-email-therbert@google.com>

Move convert_csum from udp_sock to inet_sock. This allows the
possibility that we can use convert checksum for different types
of sockets and also allows convert checksum to be enabled from
inet layer (what we'll want to do when enabling IP_CHECKSUM cmsg).

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/linux/udp.h     | 16 +---------------
 include/net/inet_sock.h | 17 +++++++++++++++++
 net/ipv4/fou.c          |  2 +-
 net/ipv4/udp.c          |  2 +-
 net/ipv4/udp_tunnel.c   |  2 +-
 net/ipv6/udp.c          |  2 +-
 6 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index ee32775..247cfdc 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -49,11 +49,7 @@ struct udp_sock {
 	unsigned int	 corkflag;	/* Cork is required */
 	__u8		 encap_type;	/* Is this an Encapsulation socket? */
 	unsigned char	 no_check6_tx:1,/* Send zero UDP6 checksums on TX? */
-			 no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
-			 convert_csum:1;/* On receive, convert checksum
-					 * unnecessary to checksum complete
-					 * if possible.
-					 */
+			 no_check6_rx:1;/* Allow zero UDP6 checksums on RX? */
 	/*
 	 * Following member retains the information to create a UDP header
 	 * when the socket is uncorked.
@@ -102,16 +98,6 @@ static inline bool udp_get_no_check6_rx(struct sock *sk)
 	return udp_sk(sk)->no_check6_rx;
 }
 
-static inline void udp_set_convert_csum(struct sock *sk, bool val)
-{
-	udp_sk(sk)->convert_csum = val;
-}
-
-static inline bool udp_get_convert_csum(struct sock *sk)
-{
-	return udp_sk(sk)->convert_csum;
-}
-
 #define udp_portaddr_for_each_entry(__sk, node, list) \
 	hlist_nulls_for_each_entry(__sk, node, list, __sk_common.skc_portaddr_node)
 
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index a829b77..360b110 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -184,6 +184,7 @@ struct inet_sock {
 				mc_all:1,
 				nodefrag:1;
 	__u8			rcv_tos;
+	__u8			convert_csum;
 	int			uc_index;
 	int			mc_index;
 	__be32			mc_addr;
@@ -250,4 +251,20 @@ static inline __u8 inet_sk_flowi_flags(const struct sock *sk)
 	return flags;
 }
 
+static inline void inet_inc_convert_csum(struct sock *sk)
+{
+	inet_sk(sk)->convert_csum++;
+}
+
+static inline void inet_dec_convert_csum(struct sock *sk)
+{
+	if (inet_sk(sk)->convert_csum > 0)
+		inet_sk(sk)->convert_csum--;
+}
+
+static inline bool inet_get_convert_csum(struct sock *sk)
+{
+	return !!inet_sk(sk)->convert_csum;
+}
+
 #endif	/* _INET_SOCK_H */
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index b986298..2197c36 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -490,7 +490,7 @@ static int fou_create(struct net *net, struct fou_cfg *cfg,
 	sk->sk_user_data = fou;
 	fou->sock = sock;
 
-	udp_set_convert_csum(sk, true);
+	inet_inc_convert_csum(sk);
 
 	sk->sk_allocation = GFP_ATOMIC;
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 13b4dcf..53358d8 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1806,7 +1806,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	if (sk != NULL) {
 		int ret;
 
-		if (udp_sk(sk)->convert_csum && uh->check && !IS_UDPLITE(sk))
+		if (inet_get_convert_csum(sk) && uh->check && !IS_UDPLITE(sk))
 			skb_checksum_try_convert(skb, IPPROTO_UDP, uh->check,
 						 inet_compute_pseudo);
 
diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
index 1671263..9996e63 100644
--- a/net/ipv4/udp_tunnel.c
+++ b/net/ipv4/udp_tunnel.c
@@ -63,7 +63,7 @@ void setup_udp_tunnel_sock(struct net *net, struct socket *sock,
 	inet_sk(sk)->mc_loop = 0;
 
 	/* Enable CHECKSUM_UNNECESSARY to CHECKSUM_COMPLETE conversion */
-	udp_set_convert_csum(sk, true);
+	inet_inc_convert_csum(sk);
 
 	rcu_assign_sk_user_data(sk, cfg->sk_user_data);
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 189dc4a..e41f017 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -909,7 +909,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 			goto csum_error;
 		}
 
-		if (udp_sk(sk)->convert_csum && uh->check && !IS_UDPLITE(sk))
+		if (inet_get_convert_csum(sk) && uh->check && !IS_UDPLITE(sk))
 			skb_checksum_try_convert(skb, IPPROTO_UDP, uh->check,
 						 ip6_compute_pseudo);
 
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next v3 0/4] ip: Support checksum returned in csmg
From: Tom Herbert @ 2015-01-05 21:56 UTC (permalink / raw)
  To: davem, netdev

This patch set allows the packet checksum for a datagram socket
to be returned in csum data in recvmsg. This allows userspace
to implement its own checksum over the data, for instance if an
IP tunnel was be implemented in user space, the inner checksum
could be validated.

Changes in this patch set:
  - Move checksum conversion to inet_sock from udp_sock. This
    generalizes checksum conversion for use with other protocols.
  - Move IP cmsg constants to a header file and make processing
    of the flags more efficient in ip_cmsg_recv
  - Return checksum value in cmsg. This is specifically the unfolded
    32 bit checksum of the full packet starting from the first byte
    returned in recvmsg

Tested: Wrote a little server to get checksums in cmsg for UDP and
        verfied correct checksum is returned.
        
Tom Herbert (4):
  ip: Move checksum convert defines to inet
  ip: IP cmsg cleanup
  ip: Add offset parameter to ip_cmsg_recv
  ip: Add offset parameter to ip_cmsg_recv

 include/linux/udp.h     |  16 +------
 include/net/inet_sock.h |  29 ++++++++++++-
 include/net/ip.h        |   7 +++-
 include/uapi/linux/in.h |   1 +
 net/ipv4/fou.c          |   2 +-
 net/ipv4/ip_sockglue.c  | 108 +++++++++++++++++++++++++++++++++++-------------
 net/ipv4/udp.c          |   4 +-
 net/ipv4/udp_tunnel.c   |   2 +-
 net/ipv6/udp.c          |   2 +-
 9 files changed, 120 insertions(+), 51 deletions(-)

-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply

* Re: [PATCH net-next v2] rhashtable: involve rhashtable_lookup_insert routine
From: Thomas Graf @ 2015-01-05 21:52 UTC (permalink / raw)
  To: David Miller; +Cc: ying.xue, netdev
In-Reply-To: <20150105.163018.658637009804208069.davem@davemloft.net>

On 01/05/15 at 04:30pm, David Miller wrote:
> From: Thomas Graf <tgraf@suug.ch>
> Date: Mon, 5 Jan 2015 13:05:14 +0000
> 
> > On 01/05/15 at 07:33pm, Ying Xue wrote:
> >> Involve a new function called rhashtable_lookup_insert() which makes
> >> lookup and insertion atomic under bucket lock protection, helping us
> >> avoid to introduce an extra lock when we search and insert an object
> >> into hash table.
> >> 
> >> Signed-off-by: Ying Xue <ying.xue@windriver.com>
> >> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> > 
> > Thanks for putting this around so quickly and thanks for testing.
> > I think this looks good. You might be able to factor out some code
> > from rhashtable_insert() to avoid duplication so we reduce the risk
> > of fixing a bug for one function but not the other.
> 
> Do you want Ying to do this factoring out now in a v3 of this patch
> or in a subsequent patch?
> 
> I assume the former since you didn't give your ACK.

Ying,

I prefer it now if you don't mind. Basically I would like to see the
grow decision factored out at least:

        /* Only grow the table if no resizing is currently in progress. */
        if (ht->tbl != ht->future_tbl &&
            ht->p.grow_decision && ht->p.grow_decision(ht, tbl->size))
                schedule_delayed_work(&ht->run_work, 0);

^ permalink raw reply

* Re: [net-next PATCH v1 01/11] net: flow_table: create interface for hw match/action tables
From: Thomas Graf @ 2015-01-05 21:48 UTC (permalink / raw)
  To: John Fastabend; +Cc: sfeldma, jiri, jhs, simon.horman, netdev, davem, andy
In-Reply-To: <54AADEFF.3090306@gmail.com>

On 01/05/15 at 10:59am, John Fastabend wrote:
> On 01/04/2015 03:12 AM, Thomas Graf wrote:
> >On 12/31/14 at 11:45am, John Fastabend wrote:
> >
> >Impressive work John, some minor nits below. In general this looks
> >great. How large could tables grow? Any risk one of the nested
> >attribtues could exceed 16K in size because of a very large parse
> >graph? Not a problem if we account for it and allow for jumbo
> >attributes.
> >
> 
> hmm it sounds large to me but maybe if you have an NPU that is trying
> to parse into application data it could happen.
> 
> What does it take to allow for jumbo attributes?

We basically need to make user space aware of a new nlattr header
to be expected for certain attributes. We can reserve the 2nd bit
of the type to indicate a 32bit length field following the current
header. We can only do this for new attributes as its not backwards
compatible so we need to think about this before we start exposing
them.

I can send a patch introducing them in the next few days if you
want as it seems you'll have to respin this again anyway.

> >You can jump to hdr_put_failure right away and get rid of the
> >attr_put_failure target as you cancel that nest anyway. You can apply
> >this comment to several other places as well if you want.
> >
> 
> OK so to simplify the error paths we only need to cancel the outer most
> nested attribute. I'll do this transformation.

It's a matter of style. I'm fine either way. Personally I prefer the
single abort error target.

^ permalink raw reply

* Re: [PATCH net-next 0/5] RDMA/cxgb4/cxgb4vf/csiostor: Cleanup register defines
From: David Miller @ 2015-01-05 21:35 UTC (permalink / raw)
  To: hariprasad
  Cc: netdev, linux-scsi, linux-rdma, JBottomley, hch, roland, leedom,
	anish, nirranjan, praveenm, swise
In-Reply-To: <1420455647-23094-1-git-send-email-hariprasad@chelsio.com>

From: Hariprasad Shenai <hariprasad@chelsio.com>
Date: Mon,  5 Jan 2015 16:30:42 +0530

> This series continues to cleanup all the macros/register defines related to
> SGE, PCIE, MC, MA, TCAM, MAC, etc that are defined in t4_regs.h and the
> affected files.
> 
> Will post another 1 or 2 series so that we can cover all the macros so that
> they all follow the same style to be consistent.
> 
> The patches series is created against 'net-next' tree.
> And includes patches on cxgb4, cxgb4vf, iw_cxgb4 and csiostor driver.
> 
> We have included all the maintainers of respective drivers. Kindly review the
> change and let us know in case of any review comments.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] be2net: support TX batching using skb->xmit_more flag
From: David Miller @ 2015-01-05 21:33 UTC (permalink / raw)
  To: sathya.perla; +Cc: netdev
In-Reply-To: <1420454914-9516-1-git-send-email-sathya.perla@emulex.com>

From: Sathya Perla <sathya.perla@emulex.com>
Date: Mon, 5 Jan 2015 05:48:34 -0500

> This patch uses skb->xmit_more flag to batch TX requests.
> TX is flushed either when xmit_more is false or there is
> no more space in the TXQ.
> 
> Skyhawk-R and BEx chips require an even number of wrbs to be posted.
> So, when a batch of TX requests is accumulated, the last header wrb
> may need to be fixed with an extra dummy wrb.
> 
> This patch refactors be_xmit() routine as a sequence of be_xmit_enqueue()
> and be_xmit_flush() calls. The Tx completion code is also
> updated to be able to unmap/free a batch of skbs rather than a single
> skb.
> 
> Signed-off-by: Sathya Perla <sathya.perla@emulex.com>

Applied.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox