Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net] udp6: set rx_dst_cookie on rx_dst updates
From: David Miller @ 2017-08-26  3:10 UTC (permalink / raw)
  To: pabeni; +Cc: netdev, subashab, hannes
In-Reply-To: <9e52c29f4bd47d591cdc7bf6c3a88b2fc57e4422.1503664112.git.pabeni@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Fri, 25 Aug 2017 14:31:01 +0200

> Currently, in the udp6 code, the dst cookie is not initialized/updated
> concurrently with the RX dst used by early demux.
> 
> As a result, the dst_check() in the early_demux path always fails,
> the rx dst cache is always invalidated, and we can't really
> leverage significant gain from the demux lookup.
> 
> Fix it adding udp6 specific variant of sk_rx_dst_set() and use it
> to set the dst cookie when the dst entry is really changed.
> 
> The issue is there since the introduction of early demux for ipv6.
> 
> Fixes: 5425077d73e0 ("net: ipv6: Add early demux handler for UDP unicast")
> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next v5] net: stmmac: Delete dead code for MDIO registration
From: David Miller @ 2017-08-26  3:08 UTC (permalink / raw)
  To: romain.perier
  Cc: peppe.cavallaro, alexandre.torgue, andrew, f.fainelli, netdev,
	linux-kernel
In-Reply-To: <20170825064959.9603-1-romain.perier@collabora.com>

From: Romain Perier <romain.perier@collabora.com>
Date: Fri, 25 Aug 2017 08:49:59 +0200

> This code is no longer used, the logging function was changed by commit
> fbca164776e4 ("net: stmmac: Use the right logging function in stmmac_mdio_register").
> It was previously showing information about the type of the IRQ, if it's
> polled, ignored or a normal interrupt. As we don't want information loss,
> I have moved this code to phy_attached_print().
> 
> Fixes: fbca164776e4 ("net: stmmac: Use the right logging function in stmmac_mdio_register")
> Signed-off-by: Romain Perier <romain.perier@collabora.com>

This doesn't apply to net-next.

^ permalink raw reply

* Re: [PATCH] net: sxgbe: check memory allocation failure
From: David Miller @ 2017-08-26  3:07 UTC (permalink / raw)
  To: christophe.jaillet
  Cc: bh74.an, ks.giri, vipul.pandya, netdev, linux-kernel,
	kernel-janitors
In-Reply-To: <20170825053551.31672-1-christophe.jaillet@wanadoo.fr>

From: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Date: Fri, 25 Aug 2017 07:35:51 +0200

> Check memory allocation failure and return -ENOMEM in such a case, as
> already done few lines below for another memory allocation.
> 
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next 3/3 v9] drivers: net: ethernet: qualcomm: rmnet: Initial implementation
From: David Miller @ 2017-08-26  3:06 UTC (permalink / raw)
  To: subashab
  Cc: netdev, fengguang.wu, dcbw, jiri, stephen, David.Laight, marcel,
	andrew
In-Reply-To: <1503635966-14076-4-git-send-email-subashab@codeaurora.org>

From: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Date: Thu, 24 Aug 2017 22:39:26 -0600

> +static void rmnet_force_unassociate_device(struct net_device *dev)
> +{
> +	struct net_device *real_dev = dev;
> +	struct rmnet_walk_data d;
> +	LIST_HEAD(list);
> +
> +	if (!rmnet_is_real_dev_registered(real_dev))
> +		return;
> +
> +	ASSERT_RTNL();
> +
> +	d.real_dev = real_dev;
> +	d.head = &list;
> +
> +	rcu_read_lock();
> +	netdev_walk_all_lower_dev_rcu(real_dev, rmnet_dev_walk_unreg, &d);
> +	synchronize_net();
> +
> +	unregister_netdevice_many(&list);
> +	rcu_read_unlock();
> +
> +	rmnet_unregister_real_device(real_dev);
> +}

In these code paths where you are the writer, you have to rely upon
the RTNL mutex (or some other mutual exclusion mechanism) to protect
the update operation.  RCU locking itself does not provide this.

So you should use something like rcu_dereference_rtnl() or similar.

So this would be rmnet_force_unassociate_device() and rmnet_dellink()

RCU is subtle and the writer paths have the be handled differently
from the reader paths.  Please take some time to study how RCU should
be applied properly in these situations rather than just slapping a
patch together overnight.

Thank you.

^ permalink raw reply

* Re: [PATCH v2 net-next 1/8] bpf: Add support for recursively running cgroup sock filters
From: Alexei Starovoitov @ 2017-08-26  2:49 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast, tj, davem
In-Reply-To: <1503687941-626-2-git-send-email-dsahern@gmail.com>

On Fri, Aug 25, 2017 at 12:05:34PM -0700, David Ahern wrote:
> Add support for recursively applying sock filters attached to a cgroup.
> For now, start with the inner cgroup attached to the socket and work back
> to the root or first cgroup without the recursive flag set. Once the
> recursive flag is set for a cgroup all descendant group's must have the
> flag as well.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>
> ---
>  include/linux/bpf-cgroup.h | 10 ++++++----
>  include/uapi/linux/bpf.h   |  9 +++++++++
>  kernel/bpf/cgroup.c        | 29 ++++++++++++++++++++++-------
>  kernel/bpf/syscall.c       |  6 +++---
>  kernel/cgroup/cgroup.c     | 25 +++++++++++++++++++++++--
>  5 files changed, 63 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index d41d40ac3efd..2d02187f242f 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -23,6 +23,7 @@ struct cgroup_bpf {
>  	struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
>  	struct bpf_prog __rcu *effective[MAX_BPF_ATTACH_TYPE];
>  	bool disallow_override[MAX_BPF_ATTACH_TYPE];
> +	bool is_recursive[MAX_BPF_ATTACH_TYPE];
>  };
>  
>  void cgroup_bpf_put(struct cgroup *cgrp);
> @@ -30,18 +31,19 @@ void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
>  
>  int __cgroup_bpf_update(struct cgroup *cgrp, struct cgroup *parent,
>  			struct bpf_prog *prog, enum bpf_attach_type type,
> -			bool overridable);
> +			u32 flags);
>  
>  /* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
>  int cgroup_bpf_update(struct cgroup *cgrp, struct bpf_prog *prog,
> -		      enum bpf_attach_type type, bool overridable);
> +		      enum bpf_attach_type type, u32 flags);
>  
>  int __cgroup_bpf_run_filter_skb(struct sock *sk,
>  				struct sk_buff *skb,
>  				enum bpf_attach_type type);
>  
> -int __cgroup_bpf_run_filter_sk(struct sock *sk,
> +int __cgroup_bpf_run_filter_sk(struct cgroup *cgrp, struct sock *sk,
>  			       enum bpf_attach_type type);
> +int cgroup_bpf_run_filter_sk(struct sock *sk, enum bpf_attach_type type);
>  
>  int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
>  				     struct bpf_sock_ops_kern *sock_ops,
> @@ -74,7 +76,7 @@ int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
>  ({									       \
>  	int __ret = 0;							       \
>  	if (cgroup_bpf_enabled && sk) {					       \
> -		__ret = __cgroup_bpf_run_filter_sk(sk,			       \
> +		__ret = cgroup_bpf_run_filter_sk(sk,			       \
>  						 BPF_CGROUP_INET_SOCK_CREATE); \
>  	}								       \
>  	__ret;								       \
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f71f5e07d82d..595e31b30f23 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -151,6 +151,15 @@ enum bpf_attach_type {
>   */
>  #define BPF_F_ALLOW_OVERRIDE	(1U << 0)
>  
> +/* If BPF_F_RECURSIVE flag is used in BPF_PROG_ATTACH command
> + * cgroups are walked recursively back to the root cgroup or the
> + * first cgroup without the flag set running any program attached.
> + * Once the flag is set, it MUST be set for all descendant cgroups.
> + */
> +#define BPF_F_RECURSIVE		(1U << 1)

above logic makes sense, but ...

> +	if (prog && curr_recursive && !new_recursive)
> +		/* if a parent has recursive prog attached, only
> +		 * allow recursive programs in descendent cgroup
> +		 */
> +		return -EINVAL;
> +
>  	old_prog = cgrp->bpf.prog[type];

... I'm struggling to completely understand how it interacts
with BPF_F_ALLOW_OVERRIDE.
By default we shouldn't allow overriding, so if default prog attached
to a root, what happens if we try to attach F_RECURSIVE to a descendent?
If I'm reading the code correctly it will not succeed, which is good.
Could you add such scenario as test to test_cgrp2_attach2.c ?

Now say we attach overridable and !recursive to a root, another
recursive prog will not be attached to a descedent, which is correct.

But if we attach !overridable + recursive to a root we cannot attach
anything to a descendent right? Then why allow such combination at all?
So only overridable + recursive combination makes sense, right?

I think all these combinations must be documented and tests must be
added. Sooner or later people will build security sensitive environment
with it and we have to meticulous now.

Do you think it would make sense to split this patch out and
push patches 2 and 3 with few tests in parallel, while we're review
this change?

Tejun needs to take a deep look into this patch as well.

^ permalink raw reply

* Re: [net-next v2 00/13][pull request] 40GbE Intel Wired LAN Driver Updates 2017-08-25
From: David Miller @ 2017-08-26  2:41 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20170825220057.51804-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Fri, 25 Aug 2017 15:00:44 -0700

> This series contains updates to i40e and i40evf only.

Pulled, thanks Jeff.

^ permalink raw reply

* Re: [PATCH v2 net-next 3/8] bpf: Allow cgroup sock filters to use get_current_uid_gid helper
From: Alexei Starovoitov @ 2017-08-26  2:30 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast, tj, davem
In-Reply-To: <1503687941-626-4-git-send-email-dsahern@gmail.com>

On Fri, Aug 25, 2017 at 12:05:36PM -0700, David Ahern wrote:
> Allow BPF programs run on sock create to use the get_current_uid_gid
> helper. IPv4 and IPv6 sockets are created in a process context so
> there is always a valid uid/gid
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH net-next 0/2] nfp: SR-IOV ndos support
From: David Miller @ 2017-08-26  2:25 UTC (permalink / raw)
  To: jakub.kicinski; +Cc: netdev, oss-drivers
In-Reply-To: <20170825043150.375-1-jakub.kicinski@netronome.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Thu, 24 Aug 2017 21:31:48 -0700

> This set adds basic SR-IOV including setting/getting VF MAC addresses,
> VLANs, link state and spoofcheck settings.  It is wired up for both
> vNICs and representors (note: ip link will not report VF settings on
> VF/PF representors because they are not linked to the PF PCI device).
> 
> Pablo and team add the basic implementation, Simon and Dirk follow
> up with the representor plumbing.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net 0/2] r8169: Be drop monitor friendly
From: David Miller @ 2017-08-26  2:13 UTC (permalink / raw)
  To: f.fainelli
  Cc: netdev, nic_swsd, romieu, edumazet, alexander.h.duyck, sgruszka
In-Reply-To: <20170825013359.27258-1-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Thu, 24 Aug 2017 18:33:57 -0700

> First patch may be questionable but no other driver appears to be doing that
> and while it is defendable to account for left packets as dropped during TX
> clean, this appears misleadning. I picked Stanislaw changes which brings us
> back to 2010, but this was present from pre-git days as well.

Right, drivers should not do this.

> Second patch fixes the two missing calls to dev_consume_skb_any().

Series applied, thanks.

^ permalink raw reply

* [PATCH net-next v2 2/2] tcp_diag: report TCP MD5 signing keys and addresses
From: Ivan Delalande @ 2017-08-26  1:53 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, netdev, Ivan Delalande
In-Reply-To: <20170826015346.24247-1-colona@arista.com>

Report TCP MD5 (RFC2385) signing keys, addresses and address prefixes to
processes with CAP_NET_ADMIN requesting INET_DIAG_INFO. Currently it is
not possible to retrieve these from the kernel once they have been
configured on sockets.

Signed-off-by: Ivan Delalande <colona@arista.com>
---
 include/uapi/linux/inet_diag.h |   1 +
 net/ipv4/tcp_diag.c            | 112 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 107 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h
index 678496897a68..f52ff62bfabe 100644
--- a/include/uapi/linux/inet_diag.h
+++ b/include/uapi/linux/inet_diag.h
@@ -143,6 +143,7 @@ enum {
 	INET_DIAG_MARK,
 	INET_DIAG_BBRINFO,
 	INET_DIAG_CLASS_ID,
+	INET_DIAG_MD5SIG,
 	__INET_DIAG_MAX,
 };
 
diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
index a748c74aa8b7..99c54b765921 100644
--- a/net/ipv4/tcp_diag.c
+++ b/net/ipv4/tcp_diag.c
@@ -16,6 +16,7 @@
 
 #include <linux/tcp.h>
 
+#include <net/netlink.h>
 #include <net/tcp.h>
 
 static void tcp_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
@@ -36,6 +37,103 @@ static void tcp_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
 		tcp_get_info(sk, info);
 }
 
+#ifdef CONFIG_TCP_MD5SIG
+static void inet_diag_md5sig_fill(struct tcp_md5sig *info,
+				  const struct tcp_md5sig_key *key)
+{
+	#if IS_ENABLED(CONFIG_IPV6)
+	if (key->family == AF_INET6) {
+		struct sockaddr_in6 *sin6 =
+			(struct sockaddr_in6 *)&info->tcpm_addr;
+
+		memcpy(&sin6->sin6_addr, &key->addr.a6,
+		       sizeof(struct in6_addr));
+	} else
+	#endif
+	{
+		struct sockaddr_in *sin =
+			(struct sockaddr_in *)&info->tcpm_addr;
+
+		memcpy(&sin->sin_addr, &key->addr.a4, sizeof(struct in_addr));
+	}
+
+	info->tcpm_addr.ss_family = key->family;
+	info->tcpm_prefixlen = key->prefixlen;
+	info->tcpm_keylen = key->keylen;
+	memcpy(info->tcpm_key, key->key, key->keylen);
+}
+
+static int inet_diag_put_md5sig(struct sk_buff *skb,
+				const struct tcp_md5sig_info *md5sig)
+{
+	const struct tcp_md5sig_key *key;
+	struct nlattr *attr;
+	struct tcp_md5sig *info;
+	int md5sig_count = 0;
+
+	hlist_for_each_entry_rcu(key, &md5sig->head, node)
+		md5sig_count++;
+
+	attr = nla_reserve(skb, INET_DIAG_MD5SIG,
+			   md5sig_count * sizeof(struct tcp_md5sig));
+	if (!attr)
+		return -EMSGSIZE;
+
+	info = nla_data(attr);
+	hlist_for_each_entry_rcu(key, &md5sig->head, node) {
+		inet_diag_md5sig_fill(info, key);
+		info++;
+	}
+
+	return 0;
+}
+#endif
+
+static int tcp_diag_get_aux(struct sock *sk, bool net_admin,
+			    struct sk_buff *skb)
+{
+#ifdef CONFIG_TCP_MD5SIG
+	if (net_admin) {
+		struct tcp_md5sig_info *md5sig;
+		int err = 0;
+
+		lock_sock(sk);
+		rcu_read_lock();
+		md5sig = rcu_dereference(tcp_sk(sk)->md5sig_info);
+		if (md5sig)
+			err = inet_diag_put_md5sig(skb, md5sig);
+		rcu_read_unlock();
+		release_sock(sk);
+		if (err < 0)
+			return err;
+	}
+#endif
+
+	return 0;
+}
+
+static size_t tcp_diag_get_aux_size(struct sock *sk, bool net_admin)
+{
+	size_t size = 0;
+
+#ifdef CONFIG_TCP_MD5SIG
+	if (sk_fullsock(sk)) {
+		const struct tcp_md5sig_info *md5sig;
+		const struct tcp_md5sig_key *key;
+
+		rcu_read_lock();
+		md5sig = rcu_dereference(tcp_sk(sk)->md5sig_info);
+		if (md5sig) {
+			hlist_for_each_entry_rcu(key, &md5sig->head, node)
+				size += sizeof(struct tcp_md5sig);
+		}
+		rcu_read_unlock();
+	}
+#endif
+
+	return size;
+}
+
 static void tcp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 			  const struct inet_diag_req_v2 *r, struct nlattr *bc)
 {
@@ -68,13 +166,15 @@ static int tcp_diag_destroy(struct sk_buff *in_skb,
 #endif
 
 static const struct inet_diag_handler tcp_diag_handler = {
-	.dump		 = tcp_diag_dump,
-	.dump_one	 = tcp_diag_dump_one,
-	.idiag_get_info	 = tcp_diag_get_info,
-	.idiag_type	 = IPPROTO_TCP,
-	.idiag_info_size = sizeof(struct tcp_info),
+	.dump			= tcp_diag_dump,
+	.dump_one		= tcp_diag_dump_one,
+	.idiag_get_info		= tcp_diag_get_info,
+	.idiag_get_aux		= tcp_diag_get_aux,
+	.idiag_get_aux_size	= tcp_diag_get_aux_size,
+	.idiag_type		= IPPROTO_TCP,
+	.idiag_info_size	= sizeof(struct tcp_info),
 #ifdef CONFIG_INET_DIAG_DESTROY
-	.destroy	 = tcp_diag_destroy,
+	.destroy		= tcp_diag_destroy,
 #endif
 };
 
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v2 1/2] inet_diag: allow protocols to provide additional data
From: Ivan Delalande @ 2017-08-26  1:53 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, netdev, Ivan Delalande

Extend inet_diag_handler to allow individual protocols to report
additional data on INET_DIAG_INFO through idiag_get_aux. The size
can be dynamic and is computed by idiag_get_aux_size.

Signed-off-by: Ivan Delalande <colona@arista.com>
---
 include/linux/inet_diag.h |  7 +++++++
 net/ipv4/inet_diag.c      | 22 ++++++++++++++++++----
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/linux/inet_diag.h b/include/linux/inet_diag.h
index 65da430e260f..ee251c585854 100644
--- a/include/linux/inet_diag.h
+++ b/include/linux/inet_diag.h
@@ -25,6 +25,13 @@ struct inet_diag_handler {
 					  struct inet_diag_msg *r,
 					  void *info);
 
+	int		(*idiag_get_aux)(struct sock *sk,
+					 bool net_admin,
+					 struct sk_buff *skb);
+
+	size_t		(*idiag_get_aux_size)(struct sock *sk,
+					      bool net_admin);
+
 	int		(*destroy)(struct sk_buff *in_skb,
 				   const struct inet_diag_req_v2 *req);
 
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 67325d5832d7..8a88ef373395 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -93,8 +93,17 @@ void inet_diag_msg_common_fill(struct inet_diag_msg *r, struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(inet_diag_msg_common_fill);
 
-static size_t inet_sk_attr_size(void)
+static size_t inet_sk_attr_size(struct sock *sk,
+				const struct inet_diag_req_v2 *req,
+				bool net_admin)
 {
+	const struct inet_diag_handler *handler;
+	size_t aux = 0;
+
+	handler = inet_diag_table[req->sdiag_protocol];
+	if (handler && handler->idiag_get_aux_size)
+		aux = handler->idiag_get_aux_size(sk, net_admin);
+
 	return	  nla_total_size(sizeof(struct tcp_info))
 		+ nla_total_size(1) /* INET_DIAG_SHUTDOWN */
 		+ nla_total_size(1) /* INET_DIAG_TOS */
@@ -105,6 +114,7 @@ static size_t inet_sk_attr_size(void)
 		+ nla_total_size(SK_MEMINFO_VARS * sizeof(u32))
 		+ nla_total_size(TCP_CA_NAME_MAX)
 		+ nla_total_size(sizeof(struct tcpvegas_info))
+		+ nla_total_size(aux)
 		+ 64;
 }
 
@@ -260,6 +270,10 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
 
 	handler->idiag_get_info(sk, r, info);
 
+	if (ext & (1 << (INET_DIAG_INFO - 1)) && handler->idiag_get_aux)
+		if (handler->idiag_get_aux(sk, net_admin, skb) < 0)
+			goto errout;
+
 	if (sk->sk_state < TCP_TIME_WAIT) {
 		union tcp_cc_info info;
 		size_t sz = 0;
@@ -452,13 +466,14 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
 	struct net *net = sock_net(in_skb->sk);
 	struct sk_buff *rep;
 	struct sock *sk;
+	bool net_admin = netlink_net_capable(in_skb, CAP_NET_ADMIN);
 	int err;
 
 	sk = inet_diag_find_one_icsk(net, hashinfo, req);
 	if (IS_ERR(sk))
 		return PTR_ERR(sk);
 
-	rep = nlmsg_new(inet_sk_attr_size(), GFP_KERNEL);
+	rep = nlmsg_new(inet_sk_attr_size(sk, req, net_admin), GFP_KERNEL);
 	if (!rep) {
 		err = -ENOMEM;
 		goto out;
@@ -467,8 +482,7 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
 	err = sk_diag_fill(sk, rep, req,
 			   sk_user_ns(NETLINK_CB(in_skb).sk),
 			   NETLINK_CB(in_skb).portid,
-			   nlh->nlmsg_seq, 0, nlh,
-			   netlink_net_capable(in_skb, CAP_NET_ADMIN));
+			   nlh->nlmsg_seq, 0, nlh, net_admin);
 	if (err < 0) {
 		WARN_ON(err == -EMSGSIZE);
 		nlmsg_free(rep);
-- 
2.14.1

^ permalink raw reply related

* Re: [PATCH v2 net-next 1/8] bpf: Add support for recursively running cgroup sock filters
From: Daniel Borkmann @ 2017-08-26  2:00 UTC (permalink / raw)
  To: David Ahern, netdev, ast, tj, davem
In-Reply-To: <1503687941-626-2-git-send-email-dsahern@gmail.com>

On 08/25/2017 09:05 PM, David Ahern wrote:
> Add support for recursively applying sock filters attached to a cgroup.
> For now, start with the inner cgroup attached to the socket and work back
> to the root or first cgroup without the recursive flag set. Once the
> recursive flag is set for a cgroup all descendant group's must have the
> flag as well.
>
> Signed-off-by: David Ahern <dsahern@gmail.com>
[...]
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f71f5e07d82d..595e31b30f23 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -151,6 +151,15 @@ enum bpf_attach_type {
>    */
>   #define BPF_F_ALLOW_OVERRIDE	(1U << 0)
>
> +/* If BPF_F_RECURSIVE flag is used in BPF_PROG_ATTACH command
> + * cgroups are walked recursively back to the root cgroup or the
> + * first cgroup without the flag set running any program attached.
> + * Once the flag is set, it MUST be set for all descendant cgroups.
> + */
> +#define BPF_F_RECURSIVE		(1U << 1)
> +
> +#define BPF_F_ALL_ATTACH_FLAGS  (BPF_F_ALLOW_OVERRIDE | BPF_F_RECURSIVE)
> +
>   /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
>    * verifier will perform strict alignment checking as if the kernel
>    * has been built with CONFIG_EFFICIENT_UNALIGNED_ACCESS not set,
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 546113430049..eb1f436c18fb 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -47,10 +47,16 @@ void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent)
>   	unsigned int type;
>
>   	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.effective); type++) {
> -		struct bpf_prog *e;
> +		struct bpf_prog *e = NULL;
> +
> +		/* do not need to set effective program if cgroups are
> +		 * walked recursively
> +		 */
> +		cgrp->bpf.is_recursive[type] = parent->bpf.is_recursive[type];
> +		if (!cgrp->bpf.is_recursive[type])
> +			e = rcu_dereference_protected(parent->bpf.effective[type],
> +						      lockdep_is_held(&cgroup_mutex));

[...]

> -		e = rcu_dereference_protected(parent->bpf.effective[type],
> -					      lockdep_is_held(&cgroup_mutex));
>   		rcu_assign_pointer(cgrp->bpf.effective[type], e);
>   		cgrp->bpf.disallow_override[type] = parent->bpf.disallow_override[type];
>   	}
[...]
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index d5774a6851f1..a1ab5dbaae89 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1187,7 +1187,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>   	if (CHECK_ATTR(BPF_PROG_ATTACH))
>   		return -EINVAL;
>
> -	if (attr->attach_flags & ~BPF_F_ALLOW_OVERRIDE)
> +	if (attr->attach_flags & ~BPF_F_ALL_ATTACH_FLAGS)
>   		return -EINVAL;
>
>   	switch (attr->attach_type) {
> @@ -1222,7 +1222,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>   	}
>
>   	ret = cgroup_bpf_update(cgrp, prog, attr->attach_type,
> -				attr->attach_flags & BPF_F_ALLOW_OVERRIDE);
> +				attr->attach_flags);
>   	if (ret)
>   		bpf_prog_put(prog);
>   	cgroup_put(cgrp);
> @@ -1252,7 +1252,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>   		if (IS_ERR(cgrp))
>   			return PTR_ERR(cgrp);
>
> -		ret = cgroup_bpf_update(cgrp, NULL, attr->attach_type, false);
> +		ret = cgroup_bpf_update(cgrp, NULL, attr->attach_type, 0);
>   		cgroup_put(cgrp);
>   		break;

Can you elaborate on the semantical changes for the programs
setting the new flag which are not using below cgroup_bpf_run_filter_sk()
helper to walk back to root?

> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index df2e0f14a95d..27a4f14435a3 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5176,14 +5176,35 @@ void cgroup_sk_free(struct sock_cgroup_data *skcd)
>
>   #ifdef CONFIG_CGROUP_BPF
>   int cgroup_bpf_update(struct cgroup *cgrp, struct bpf_prog *prog,
> -		      enum bpf_attach_type type, bool overridable)
> +		      enum bpf_attach_type type, u32 flags)
>   {
>   	struct cgroup *parent = cgroup_parent(cgrp);
>   	int ret;
>
>   	mutex_lock(&cgroup_mutex);
> -	ret = __cgroup_bpf_update(cgrp, parent, prog, type, overridable);
> +	ret = __cgroup_bpf_update(cgrp, parent, prog, type, flags);
>   	mutex_unlock(&cgroup_mutex);
>   	return ret;
>   }
> +
> +int cgroup_bpf_run_filter_sk(struct sock *sk,
> +			     enum bpf_attach_type type)
> +{
> +	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> +	int ret = 0;
> +
> +	while (cgrp) {
> +		ret = __cgroup_bpf_run_filter_sk(cgrp, sk, type);
> +		if (ret)
> +			break;
> +
> +		if (!cgrp->bpf.is_recursive[type])
> +			break;
> +
> +		cgrp = cgroup_parent(cgrp);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(cgroup_bpf_run_filter_sk);
>   #endif /* CONFIG_CGROUP_BPF */
>

^ permalink raw reply

* Re: UDP sockets oddities
From: Eric Dumazet @ 2017-08-26  1:52 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, edumazet, pabeni, willemb, davem
In-Reply-To: <ce4af8cc-3de1-a777-967c-a57103994e1d@gmail.com>

On Fri, 2017-08-25 at 18:17 -0700, Florian Fainelli wrote:
> On 08/25/2017 04:57 PM, Eric Dumazet wrote:
> > On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote:
> > 
> >> Eric, are there areas of the stack where we are allowed to drop packets,
> >> not propagate that back to write(2) and also not increment any counter
> >> either, or maybe I am not looking where I should...
> > 
> > What happens if you increase these sysctls ?
> 
> I don't see packet loss after I tweak these two sysctls according to
> your suggestions.
> 
> Tweaking eth0's sysctls did not change anything, but tweaking gphy's
> sysctl resolved the loss. This was a little surprising considering that
> gphy is an IFF_NO_QUEUE interface and eth0 is the conduit interface that
> does the real transmission.
> 
> Does that make sense with respect to what I reported earlier? Should I
> try to dump the neigh stats?

Note that if you had TCP traffic, the neighbour would be constantly
confirmed and no losses would happen.

I guess we should an SNMP counter for packets dropped in neigh queues.

^ permalink raw reply

* Re: UDP sockets oddities
From: Florian Fainelli @ 2017-08-26  1:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, edumazet, pabeni, willemb, davem
In-Reply-To: <1503705440.11498.9.camel@edumazet-glaptop3.roam.corp.google.com>

On 08/25/2017 04:57 PM, Eric Dumazet wrote:
> On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote:
> 
>> Eric, are there areas of the stack where we are allowed to drop packets,
>> not propagate that back to write(2) and also not increment any counter
>> either, or maybe I am not looking where I should...
> 
> What happens if you increase these sysctls ?

I don't see packet loss after I tweak these two sysctls according to
your suggestions.

Tweaking eth0's sysctls did not change anything, but tweaking gphy's
sysctl resolved the loss. This was a little surprising considering that
gphy is an IFF_NO_QUEUE interface and eth0 is the conduit interface that
does the real transmission.

Does that make sense with respect to what I reported earlier? Should I
try to dump the neigh stats?

Thanks!

> 
> grep .  `find /proc/sys|grep unres_qlen`
> 
> 
> unres_qlen_bytes -> 2000000
> unres_qlen -> 10000
> 
> 


-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next v7 05/10] landlock: Add LSM hooks related to filesystem
From: Alexei Starovoitov @ 2017-08-26  1:16 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-kernel, Alexei Starovoitov, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, Jonathan Corbet, Matthew Garrett,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Tejun Heo, Thomas Graf <tgr
In-Reply-To: <22d09137-7212-5803-af64-0964fad875c7@digikod.net>

On Fri, Aug 25, 2017 at 10:16:39AM +0200, Mickaël Salaün wrote:
> > 
> >> +/* WRAP_ARG_SB */
> >> +#define WRAP_ARG_SB_TYPE	WRAP_TYPE_FS
> >> +#define WRAP_ARG_SB_DEC(arg)					\
> >> +	EXPAND_C(WRAP_TYPE_FS) wrap_##arg =			\
> >> +	{ .type = BPF_HANDLE_FS_TYPE_DENTRY, .dentry = arg->s_root };
> >> +#define WRAP_ARG_SB_VAL(arg)	((uintptr_t)&wrap_##arg)
> >> +#define WRAP_ARG_SB_OK(arg)	(arg && arg->s_root)
> > ...
> > 
> >> +HOOK_NEW_FS(sb_remount, 2,
> >> +	struct super_block *, sb,
> >> +	void *, data,
> >> +	WRAP_ARG_SB, sb,
> >> +	WRAP_ARG_RAW, LANDLOCK_ACTION_FS_WRITE
> >> +);
> > 
> > this looks wrong. casting super_block to dentry?
> 
> This is called when remounting a block device. The WRAP_ARG_SB take the
> sb->s_root as a dentry, it is not a cast. What do you expect from this hook?

got it. I missed -> part. Now it makes sense.

> > 
> >> +/* a directory inode contains only one dentry */
> >> +HOOK_NEW_FS(inode_create, 3,
> >> +	struct inode *, dir,
> >> +	struct dentry *, dentry,
> >> +	umode_t, mode,
> >> +	WRAP_ARG_INODE, dir,
> >> +	WRAP_ARG_RAW, LANDLOCK_ACTION_FS_WRITE
> >> +);
> > 
> > more general question: why you're not wrapping all useful
> > arguments? Like in the above dentry can be acted upon
> > by the landlock rule and it's readily available...
> 
> The context used for the FS event must have the exact same types for all
> calls. This event is meant to be generic but we can add more specific
> ones if needed, like I do with FS_IOCTL.

I see. So all FS events will have dentry as first argument
regardless of how it is in LSM hook ?
I guess that will simplify the rules indeed.
I suspect you're doing it to simplify the LSM->landlock shim layer as well, right?

> The idea is to enable people to write simple rules, while being able to
> write fine grain rules for special cases (e.g. IOCTL) if needed.
> 
> > 
> > The limitation of only 2 args looks odd.
> > Is it a hard limitation ? how hard to extend?
> 
> It's not a hard limit at all. Actually, the FS_FNCTL event should have
> three arguments (I'll add them in the next series): FS handle, FCNTL
> command and FCNTL argument. I made sure that it's really easy to add
> more arguments to the context of an event.

The reason I'm asking, because I'm not completely convinced that
adding another argument to existing event will be backwards compatible.
It looks like you're expecting only two args for all FS events, right?
How can you add 3rd argument? All FS events would have to get it,
but in some LSM hooks such argument will be meaningless, whereas
in other places it will carry useful info that rule can operate on.
Would that mean that we'll have FS_3 event type and only few LSM
hooks will be converted to it. That works, but then we'll lose
compatiblity with old rules written for FS event and that given hook.
Otherwise we'd need to have fancy logic to accept old FS event
into FS_3 LSM hook.



^ permalink raw reply

* Re: [PATCH net-next] selftests/bpf: check the instruction dumps are populated
From: Daniel Borkmann @ 2017-08-26  1:16 UTC (permalink / raw)
  To: Jakub Kicinski, netdev; +Cc: kafai, oss-drivers
In-Reply-To: <20170825213957.4768-1-jakub.kicinski@netronome.com>

On 08/25/2017 11:39 PM, Jakub Kicinski wrote:
> Add a basic test for checking whether kernel is populating
> the jited and xlated BPF images.  It was used to confirm
> the behaviour change from commit d777b2ddbecf ("bpf: don't
> zero out the info struct in bpf_obj_get_info_by_fd()"),
> which made bpf_obj_get_info_by_fd() usable for retrieving
> the image dumps.
>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
[...]
> @@ -328,15 +331,20 @@ static void test_bpf_obj_id(void)
>   			  prog_infos[i].type != BPF_PROG_TYPE_SOCKET_FILTER ||
>   			  info_len != sizeof(struct bpf_prog_info) ||
>   			  (jit_enabled && !prog_infos[i].jited_prog_len) ||
> -			  !prog_infos[i].xlated_prog_len,
> +			  (jit_enabled &&
> +			   !memcmp(jited_insns, zeros, sizeof(zeros))) ||
> +			  !prog_infos[i].xlated_prog_len ||
> +			  !memcmp(xlated_insns, zeros, sizeof(zeros)),

There could still be the case where a JIT could bail out
for some reason and punt to the interpreter instead, but
I'm fine assuming for the specific test cases we have it
has to succeed, and if not JIT misses features or has other
issues. ;) Thus:

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

Thanks!

^ permalink raw reply

* [PATCH net-next v3] e1000e: Be drop monitor friendly
From: Florian Fainelli @ 2017-08-26  1:14 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, davem, Florian Fainelli, Jeff Kirsher,
	moderated list:INTEL ETHERNET DRIVERS, open list

e1000e_put_txbuf() can be called from normal reclamation path as well as
when a DMA mapping failure, so we need to differentiate these two cases
when freeing SKBs to be drop monitor friendly. e1000e_tx_hwtstamp_work()
and e1000_remove() are processing TX timestamped SKBs and those should
not be accounted as drops either.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
Changes in v3:

- differentiate normal reclamation from TX DMA fragment mapping errors
- removed a few invalid dev_kfree_skb() replacements (those are already
  drop monitor friendly)

Changes in v2:

- make it compile

 drivers/net/ethernet/intel/e1000e/netdev.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 327dfe5bedc0..cfd21858c095 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1071,7 +1071,8 @@ static bool e1000_clean_rx_irq(struct e1000_ring *rx_ring, int *work_done,
 }
 
 static void e1000_put_txbuf(struct e1000_ring *tx_ring,
-			    struct e1000_buffer *buffer_info)
+			    struct e1000_buffer *buffer_info,
+			    bool drop)
 {
 	struct e1000_adapter *adapter = tx_ring->adapter;
 
@@ -1085,7 +1086,10 @@ static void e1000_put_txbuf(struct e1000_ring *tx_ring,
 		buffer_info->dma = 0;
 	}
 	if (buffer_info->skb) {
-		dev_kfree_skb_any(buffer_info->skb);
+		if (drop)
+			dev_kfree_skb_any(buffer_info->skb);
+		else
+			dev_consume_skb_any(buffer_info->skb);
 		buffer_info->skb = NULL;
 	}
 	buffer_info->time_stamp = 0;
@@ -1199,7 +1203,7 @@ static void e1000e_tx_hwtstamp_work(struct work_struct *work)
 		wmb(); /* force write prior to skb_tstamp_tx */
 
 		skb_tstamp_tx(skb, &shhwtstamps);
-		dev_kfree_skb_any(skb);
+		dev_consume_skb_any(skb);
 	} else if (time_after(jiffies, adapter->tx_hwtstamp_start
 			      + adapter->tx_timeout_factor * HZ)) {
 		dev_kfree_skb_any(adapter->tx_hwtstamp_skb);
@@ -1254,7 +1258,7 @@ static bool e1000_clean_tx_irq(struct e1000_ring *tx_ring)
 				}
 			}
 
-			e1000_put_txbuf(tx_ring, buffer_info);
+			e1000_put_txbuf(tx_ring, buffer_info, false);
 			tx_desc->upper.data = 0;
 
 			i++;
@@ -2421,7 +2425,7 @@ static void e1000_clean_tx_ring(struct e1000_ring *tx_ring)
 
 	for (i = 0; i < tx_ring->count; i++) {
 		buffer_info = &tx_ring->buffer_info[i];
-		e1000_put_txbuf(tx_ring, buffer_info);
+		e1000_put_txbuf(tx_ring, buffer_info, false);
 	}
 
 	netdev_reset_queue(adapter->netdev);
@@ -5614,7 +5618,7 @@ static int e1000_tx_map(struct e1000_ring *tx_ring, struct sk_buff *skb,
 			i += tx_ring->count;
 		i--;
 		buffer_info = &tx_ring->buffer_info[i];
-		e1000_put_txbuf(tx_ring, buffer_info);
+		e1000_put_txbuf(tx_ring, buffer_info, true);
 	}
 
 	return 0;
@@ -7411,7 +7415,7 @@ static void e1000_remove(struct pci_dev *pdev)
 	if (adapter->flags & FLAG_HAS_HW_TIMESTAMP) {
 		cancel_work_sync(&adapter->tx_hwtstamp_work);
 		if (adapter->tx_hwtstamp_skb) {
-			dev_kfree_skb_any(adapter->tx_hwtstamp_skb);
+			dev_consume_skb_any(adapter->tx_hwtstamp_skb);
 			adapter->tx_hwtstamp_skb = NULL;
 		}
 	}
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH net-next v7 01/10] selftest: Enhance kselftest_harness.h with a step mechanism
From: Alexei Starovoitov @ 2017-08-26  1:07 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Shuah Khan, linux-kernel, Alexei Starovoitov, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, Jonathan Corbet, Matthew Garrett,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Tejun Heo, Thomas Graf <tgr
In-Reply-To: <0e15da13-fad0-ba01-053c-1b4853e2bd6f@digikod.net>

On Fri, Aug 25, 2017 at 09:58:33AM +0200, Mickaël Salaün wrote:
> 
> 
> On 24/08/2017 04:31, Alexei Starovoitov wrote:
> > On Mon, Aug 21, 2017 at 02:09:24AM +0200, Mickaël Salaün wrote:
> >> This step mechanism may be useful to return an information about the
> >> error without being able to write to TH_LOG_STREAM.
> >>
> >> Set _metadata->no_print to true to print this counter.
> >>
> >> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> >> Cc: Andy Lutomirski <luto@amacapital.net>
> >> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> >> Cc: Kees Cook <keescook@chromium.org>
> >> Cc: Shuah Khan <shuah@kernel.org>
> >> Cc: Will Drewry <wad@chromium.org>
> >> Link: https://lkml.kernel.org/r/CAGXu5j+D-FP8Kt9unNOqKrQJP4DYTpmgkJxWykZyrYiVPz3Y3Q@mail.gmail.com
> >> ---
> >>
> >> This patch is intended to the kselftest tree:
> >> https://lkml.kernel.org/r/20170806232337.4191-1-mic@digikod.net
> >>
> >> Changes since v6:
> >> * add the step counter in assert/expect macros and use _metadata to
> >>   enable the counter (suggested by Kees Cook)
> >> ---
> >>  tools/testing/selftests/kselftest_harness.h   | 31 ++++++++++++++++++++++-----
> >>  tools/testing/selftests/seccomp/seccomp_bpf.c |  2 +-
> >>  2 files changed, 27 insertions(+), 6 deletions(-)
> > 
> > is there a dependency on this in patches 2+ ?
> > if not, I would send this patch to selftests right away.
> > 
> > 
> 
> The Landlock tests [patch 9/10] rely on it for now.
> 
> I sent it three weeks ago:
> https://lkml.kernel.org/r/20170806232337.4191-1-mic@digikod.net
> 
> Anyway, until this patch is merged in the kselftest tree and then
> available to net-next, I'll have to keep it here.

Shuah,
could you please pick up this patch into your tree?

^ permalink raw reply

* Re: [PATCH net-next] bpf: fix oops on allocation failure
From: Alexei Starovoitov @ 2017-08-26  1:05 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Alexei Starovoitov, John Fastabend, Daniel Borkmann, netdev,
	kernel-janitors
In-Reply-To: <20170825202714.64ivixeindjph3z6@mwanda>

On Fri, Aug 25, 2017 at 11:27:14PM +0300, Dan Carpenter wrote:
> "err" is set to zero if bpf_map_area_alloc() fails so it means we return
> ERR_PTR(0) which is NULL.  The caller, find_and_alloc_map(), is not
> expecting NULL returns and will oops.
> 
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

good catch. Thanks!
Acked-by: Alexei Starovoitov <ast@kernel.org>


^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Willem de Bruijn @ 2017-08-26  1:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development
In-Reply-To: <20170826022744-mutt-send-email-mst@kernel.org>

On Fri, Aug 25, 2017 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:
>> >> >> > We don't enable network watchdog on virtio but we could and maybe
>> >> >> > should.
>> >> >>
>> >> >> Can you elaborate?
>> >> >
>> >> > The issue is that holding onto buffers for very long times makes guests
>> >> > think they are stuck. This is funamentally because from guest point of
>> >> > view this is a NIC, so it is supposed to transmit things out in
>> >> > a timely manner. If host backs the virtual NIC by something that is not
>> >> > a NIC, with traffic shaping etc introducing unbounded latencies,
>> >> > guest will be confused.
>> >>
>> >> That assumes that guests are fragile in this regard. A linux guest
>> >> does not make such assumptions.
>> >
>> > Yes it does. Examples above:
>> >         > > - a single slow flow can occupy the whole ring, you will not
>> >         > >   be able to make any new buffers available for the fast flow
>>
>> Oh, right. Though those are due to vring_desc pool exhaustion
>> rather than an upper bound on latency of any single packet.
>>
>> Limiting the number of zerocopy packets in flight to some fraction
>> of the ring ensures that fast flows can always grab a slot.
>> Running
>> out of ubuf_info slots reverts to copy, so indirectly does this. But
>> I read it correclty the zerocopy pool may be equal to or larger than
>> the descriptor pool. Should we refine the zcopy_used test
>>
>>     (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx
>>
>> to also return false if the number of outstanding ubuf_info is greater
>> than, say, vq->num >> 1?
>
>
> We'll need to think about where to put the threshold, but I think it's
> a good idea.
>
> Maybe even a fixed number, e.g. max(vq->num >> 1, X) to limit host
> resources.
>
> In a sense it still means once you run out of slots zcopt gets disabled possibly permanently.
>
> Need to experiment with some numbers.

I can take a stab with two flows, one delayed in a deep host qdisc
queue. See how this change affects the other flow and also how
sensitive that is to the chosen threshold value.

^ permalink raw reply

* Re: Permissions for eBPF objects
From: Alexei Starovoitov @ 2017-08-26  1:03 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Chenbo Feng, Jeffrey Vander Stoep, Stephen Smalley, netdev,
	SELinux, mic
In-Reply-To: <59A0837F.9090301@iogearbox.net>

On Fri, Aug 25, 2017 at 10:07:27PM +0200, Daniel Borkmann wrote:
> On 08/25/2017 09:52 PM, Chenbo Feng wrote:
> > On Fri, Aug 25, 2017 at 12:45 PM, Jeffrey Vander Stoep <jeffv@google.com> wrote:
> > > On Fri, Aug 25, 2017 at 12:26 PM, Stephen Smalley <sds@tycho.nsa.gov> wrote:
> > > > On Fri, 2017-08-25 at 11:01 -0700, Jeffrey Vander Stoep via Selinux
> > > > wrote:
> > > > > I’d like to get your thoughts on adding LSM permission checks on BPF
> > > > > objects.

before reinventing the wheel please take a look at landlock work.
Everything that was discussed in this thread is covered by it.
The patches have been in development for more than a year and most of the early
issues have been resolved.
It will be presented again during security summit in LA in September.

^ permalink raw reply

* Re: [PATCH net-next] tcp: fix hang in tcp_sendpage_locked()
From: David Miller @ 2017-08-26  0:22 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, tom, dvyukov
In-Reply-To: <1503667625.18816.9.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 25 Aug 2017 06:27:05 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> syszkaller got a hang in tcp stack, related to a bug in
> tcp_sendpage_locked()
> 
> root@syzkaller:~# cat /proc/3059/stack
> [<ffffffff83de926c>] __lock_sock+0x1dc/0x2f0
> [<ffffffff83de9473>] lock_sock_nested+0xf3/0x110
> [<ffffffff8408ce01>] tcp_sendmsg+0x21/0x50
> [<ffffffff84163b6f>] inet_sendmsg+0x11f/0x5e0
> [<ffffffff83dd8eea>] sock_sendmsg+0xca/0x110
> [<ffffffff83dd9547>] kernel_sendmsg+0x47/0x60
> [<ffffffff83de35dc>] sock_no_sendpage+0x1cc/0x280
> [<ffffffff8408916b>] tcp_sendpage_locked+0x10b/0x160
> [<ffffffff84089203>] tcp_sendpage+0x43/0x60
> [<ffffffff841641da>] inet_sendpage+0x1aa/0x660
> [<ffffffff83dd4fcd>] kernel_sendpage+0x8d/0xe0
> [<ffffffff83dd50ac>] sock_sendpage+0x8c/0xc0
> [<ffffffff81b63300>] pipe_to_sendpage+0x290/0x3b0
> [<ffffffff81b67243>] __splice_from_pipe+0x343/0x750
> [<ffffffff81b6a459>] splice_from_pipe+0x1e9/0x330
> [<ffffffff81b6a5e0>] generic_splice_sendpage+0x40/0x50
> [<ffffffff81b6b1d7>] SyS_splice+0x7b7/0x1610
> [<ffffffff84d77a01>] entry_SYSCALL_64_fastpath+0x1f/0xbe
> 
> Fixes: 306b13eb3cf9 ("proto_ops: Add locked held versions of sendmsg and sendpage")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Dmitry Vyukov <dvyukov@google.com>

APplied, thanks Eric.

^ permalink raw reply

* Re: [Patch net-next v2 0/4] net_sched: clean up tc classes and u32 filter
From: David Miller @ 2017-08-26  0:20 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, jhs
In-Reply-To: <20170824235130.28503-1-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Thu, 24 Aug 2017 16:51:26 -0700

> Patch 1 and patch 2 prepare for patch 3. Major changes
> are in patch 3 and patch 4, details are there too.
> 
> Cong Wang (4):
>   net_sched: get rid of more forward declarations
>   net_sched: introduce tclass_del_notify()
>   net_sched: remove tc class reference counting
>   net_sched: kill u32_node pointer in Qdisc
> 
> ---
> v2: Add patch 1 and 2, group all into a patchset
>     Fix a coding style issue in patch 4

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net] tcp: fix refcnt leak with ebpf congestion control
From: David Miller @ 2017-08-26  0:16 UTC (permalink / raw)
  To: sd; +Cc: netdev, brakmo, daniel
In-Reply-To: <d69d1981d2c74805affc4ed6378447ea4cb06c67.1503659184.git.sd@queasysnail.net>

From: Sabrina Dubroca <sd@queasysnail.net>
Date: Fri, 25 Aug 2017 13:10:12 +0200

> There are a few bugs around refcnt handling in the new BPF congestion
> control setsockopt:
> 
>  - The new ca is assigned to icsk->icsk_ca_ops even in the case where we
>    cannot get a reference on it. This would lead to a use after free,
>    since that ca is going away soon.
> 
>  - Changing the congestion control case doesn't release the refcnt on
>    the previous ca.
> 
>  - In the reinit case, we first leak a reference on the old ca, then we
>    call tcp_reinit_congestion_control on the ca that we have just
>    assigned, leading to deinitializing the wrong ca (->release of the
>    new ca on the old ca's data) and releasing the refcount on the ca
>    that we actually want to use.
> 
> This is visible by building (for example) BIC as a module and setting
> net.ipv4.tcp_congestion_control=bic, and using tcp_cong_kern.c from
> samples/bpf.
> 
> This patch fixes the refcount issues, and moves reinit back into tcp
> core to avoid passing a ca pointer back to BPF.
> 
> Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH] hinic: skb_pad() frees on error
From: David Miller @ 2017-08-26  0:14 UTC (permalink / raw)
  To: dan.carpenter; +Cc: aviad.krawczyk, netdev, kernel-janitors
In-Reply-To: <20170825082428.hpnbs4i74bubm4cz@mwanda>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Fri, 25 Aug 2017 11:24:28 +0300

> The skb_pad() function frees the skb on error, so this code has a double
> free.
> 
> Fixes: 00e57a6d4ad3 ("net-next/hinic: Add Tx operation")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Applied, thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox