Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next V2 1/2] cls_flower: Fix missing free of rhashtable
From: Simon Horman @ 2018-06-04 17:51 UTC (permalink / raw)
  To: Paul Blakey
  Cc: Jiri Pirko, Cong Wang, Jamal Hadi Salim, David Miller, netdev,
	Yevgeny Kliteynik, Roi Dayan, Shahar Klein, Mark Bloch,
	Or Gerlitz
In-Reply-To: <1528009574-63306-1-git-send-email-paulb@mellanox.com>

On Sun, Jun 03, 2018 at 10:06:13AM +0300, Paul Blakey wrote:
> When destroying the instance, destroy the head rhashtable.
> 
> Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority")
> Reported-by: Vlad Buslov <vladbu@mellanox.com>
> Reviewed-by: Roi Dayan <roid@mellanox.com>
> Reviewed-by: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Paul Blakey <paulb@mellanox.com>

Reviewed-by: Simon Horman <simon.horman@netronome.com>

^ permalink raw reply

* Re: [PATCH v5 2/3] media: rc: introduce BPF_PROG_LIRC_MODE2
From: Matthias Reichl @ 2018-06-04 17:47 UTC (permalink / raw)
  To: Sean Young
  Cc: linux-media, linux-kernel, Alexei Starovoitov,
	Mauro Carvalho Chehab, Daniel Borkmann, netdev, Devin Heitmueller,
	Y Song, Quentin Monnet
In-Reply-To: <9f2c54d4956f962f44fcda739a824397ddea132c.1527419762.git.sean@mess.org>

Hi Sean,

I finally found the time to test your patch series and noticed
2 issues - comments are inline

On Sun, May 27, 2018 at 12:24:09PM +0100, Sean Young wrote:
> diff --git a/drivers/media/rc/Kconfig b/drivers/media/rc/Kconfig
> index eb2c3b6eca7f..d5b35a6ba899 100644
> --- a/drivers/media/rc/Kconfig
> +++ b/drivers/media/rc/Kconfig
> @@ -25,6 +25,19 @@ config LIRC
>  	   passes raw IR to and from userspace, which is needed for
>  	   IR transmitting (aka "blasting") and for the lirc daemon.
>  
> +config BPF_LIRC_MODE2
> +	bool "Support for eBPF programs attached to lirc devices"
> +	depends on BPF_SYSCALL
> +	depends on RC_CORE=y

Requiring rc-core to be built into the kernel could become
problematic in the future for people using media_build.

Currently the whole media tree (including rc-core) can be built
as modules so DVB and IR drivers can be replaced by newer versions.
But with rc-core in the kernel things could easily break if internal
data structures are changed.

Maybe we should add a small layer with a stable API/ABI between
bpf-lirc and rc-core to decouple them? Or would it be possible
to build rc-core with bpf support as a module?

> +	depends on LIRC
> +	help
> +	   Allow attaching eBPF programs to a lirc device using the bpf(2)
> +	   syscall command BPF_PROG_ATTACH. This is supported for raw IR
> +	   receivers.
> +
> +	   These eBPF programs can be used to decode IR into scancodes, for
> +	   IR protocols not supported by the kernel decoders.
> +
>  menuconfig RC_DECODERS
>  	bool "Remote controller decoders"
>  	depends on RC_CORE
> [...]
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 388d4feda348..3c104113d040 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -11,6 +11,7 @@
>   */
>  #include <linux/bpf.h>
>  #include <linux/bpf_trace.h>
> +#include <linux/bpf_lirc.h>
>  #include <linux/btf.h>
>  #include <linux/syscalls.h>
>  #include <linux/slab.h>
> @@ -1578,6 +1579,8 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>  	case BPF_SK_SKB_STREAM_PARSER:
>  	case BPF_SK_SKB_STREAM_VERDICT:
>  		return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, true);
> +	case BPF_LIRC_MODE2:
> +		return lirc_prog_attach(attr);
>  	default:
>  		return -EINVAL;
>  	}
> @@ -1648,6 +1651,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>  	case BPF_SK_SKB_STREAM_PARSER:
>  	case BPF_SK_SKB_STREAM_VERDICT:
>  		return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, false);
> +	case BPF_LIRC_MODE2:
> +		return lirc_prog_detach(attr);
>  	default:
>  		return -EINVAL;
>  	}
> @@ -1695,6 +1700,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
>  	case BPF_CGROUP_SOCK_OPS:
>  	case BPF_CGROUP_DEVICE:
>  		break;
> +	case BPF_LIRC_MODE2:
> +		return lirc_prog_query(attr, uattr);

When testing this patch series I was wondering why I always got
-EINVAL when trying to query the registered programs.

Closer inspection revealed that bpf_prog_attach/detach/query and
calls to them in the bpf syscall are in "#ifdef CONFIG_CGROUP_BPF"
blocks - and as I built the kernel without CONFIG_CGROUP_BPF
BPF_PROG_ATTACH/DETACH/QUERY weren't handled in the syscall switch
and I got -EINVAL from the bpf syscall function.

I haven't checked in detail yet, but it looks to me like
bpf_prog_attach/detach/query could always be built (or when
either cgroup bpf or lirc bpf are enabled) and the #ifdefs moved
inside the switch(). So lirc bpf could be used without cgroup bpf.
Or am I missing something?

so long,

Hias
>  	default:
>  		return -EINVAL;
>  	}
> -- 
> 2.17.0
> 

^ permalink raw reply

* Re: [PATCH] net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets
From: Eric Dumazet @ 2018-06-04 17:24 UTC (permalink / raw)
  To: Maciej Żenczykowski, Maciej Żenczykowski,
	David S . Miller
  Cc: Eric Dumazet, netdev
In-Reply-To: <20180603174705.51802-1-zenczykowski@gmail.com>



On 06/03/2018 10:47 AM, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <maze@google.com>
> 
> It is not safe to do so because such sockets are already in the
> hash tables and changing these options can result in invalidating
> the tb->fastreuse(port) caching.
> 

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [RFC PATCH 0/6] net: ethernet: ti: cpsw: add MQPRIO and CBS Qdisc offload
From: Vinicius Costa Gomes @ 2018-06-04 17:23 UTC (permalink / raw)
  To: Ivan Khoronzhuk, grygorii.strashko, davem
  Cc: corbet, akpm, netdev, linux-doc, linux-kernel, linux-omap, henrik,
	jesus.sanchez-palencia, Ivan Khoronzhuk
In-Reply-To: <20180518211510.13341-1-ivan.khoronzhuk@linaro.org>

Hi Ivan,

Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> writes:

> This series adds MQPRIO and CBS Qdisc offload for TI cpsw driver.
> It potentially can be used in audio video bridging (AVB) and time
> sensitive networking (TSN).
>
> Patchset was tested on AM572x EVM and BBB boards. Last patch from this
> series adds detailed description of configuration with examples. For
> consistency reasons, in role of talker and listener, tools from
> patchset "TSN: Add qdisc based config interface for CBS" were used and
> can be seen here: https://www.spinics.net/lists/netdev/msg460869.html
>
> Based on net-next/master
>

I didn't test this, but it looks fine from my side.

I agree with Grygorii, that if no comments, this should be re-sent as a
patch series next.


> Ivan Khoronzhuk (6):
>   net: ethernet: ti: cpsw: use cpdma channels in backward order for txq
>   net: ethernet: ti: cpdma: fit rated channels in backward order
>   net: ethernet: ti: cpsw: add MQPRIO Qdisc offload
>   net: ethernet: ti: cpsw: add CBS Qdisc offload
>   net: ethernet: ti: cpsw: restore shaper configuration while down/up
>   Documentation: networking: cpsw: add MQPRIO & CBS offload examples
>
>  Documentation/networking/cpsw.txt       | 540 ++++++++++++++++++++++++
>  drivers/net/ethernet/ti/cpsw.c          | 364 +++++++++++++++-
>  drivers/net/ethernet/ti/davinci_cpdma.c |  31 +-
>  3 files changed, 913 insertions(+), 22 deletions(-)
>  create mode 100644 Documentation/networking/cpsw.txt
>
> -- 
> 2.17.0


Cheers,
--
Vinicius

^ permalink raw reply

* Re: [PATCH net v3] ipv6: omit traffic class when calculating flow hash
From: David Miller @ 2018-06-04 17:22 UTC (permalink / raw)
  To: mkubecek; +Cc: netdev, linux-kernel, nicolas.dichtel, tom, dsahern, idosch
In-Reply-To: <20180604095619.7D9C1A09F0@unicorn.suse.cz>

From: Michal Kubecek <mkubecek@suse.cz>
Date: Mon, 4 Jun 2018 11:36:05 +0200

> Some of the code paths calculating flow hash for IPv6 use flowlabel member
> of struct flowi6 which, despite its name, encodes both flow label and
> traffic class. If traffic class changes within a TCP connection (as e.g.
> ssh does), ECMP route can switch between path. It's also inconsistent with
> other code paths where ip6_flowlabel() (returning only flow label) is used
> to feed the key.
> 
> Use only flow label everywhere, including one place where hash key is set
> using ip6_flowinfo().
> 
> Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)")
> Fixes: f70ea018da06 ("net: Add functions to get skb->hash based on flow structures")
> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
> ---
> v2: introduce and use an inline helper as suggested by David Ahern
> v3: keep the cast out of the helper to make future cleanup easier

Applied and queued up for -stable, thank you.

^ permalink raw reply

* Re: [PATCH] net-tcp: extend tcp_tw_reuse sysctl to enable loopback only optimization
From: Eric Dumazet @ 2018-06-04 17:18 UTC (permalink / raw)
  To: Maciej Żenczykowski, Maciej Żenczykowski,
	David S . Miller, Eric Dumazet
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Wei Wang
In-Reply-To: <20180603174117.48539-1-zenczykowski@gmail.com>



On 06/03/2018 10:41 AM, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <maze@google.com>
> 
> This changes the /proc/sys/net/ipv4/tcp_tw_reuse from a boolean
> to an integer.
> 
> It now takes the values 0, 1 and 2, where 0 and 1 behave as before,
> while 2 enables timewait socket reuse only for sockets that we can
> prove are loopback connections:
>   ie. bound to 'lo' interface or where one of source or destination
>   IPs is 127.0.0.0/8, ::ffff:127.0.0.0/104 or ::1.
> 
> This enables quicker reuse of ephemeral ports for loopback connections
> - where tcp_tw_reuse is 100% safe from a protocol perspective
> (this assumes no artificially induced packet loss on 'lo').
> 
> This also makes estblishing many loopback connections *much* faster
> (allocating ports out of the first half of the ephemeral port range
> is significantly faster, then allocating from the second half)
> 
> Without this change in a 32K ephemeral port space my sample program
> (it just establishes and closes [::1]:ephemeral -> [::1]:server_port
> connections in a tight loop) fails after 32765 connections in 24 seconds.
> With it enabled 50000 connections only take 4.7 seconds.
> 
> This is particularly problematic for IPv6 where we only have one local
> address and cannot play tricks with varying source IP from 127.0.0.0/8
> pool.
> 
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Wei Wang <weiwan@google.com>

This seems fine, thanks Maciej

Signed-off-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* [PATCH net] l2tp: fix refcount leakage on PPPoL2TP sockets
From: Guillaume Nault @ 2018-06-04 16:52 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman

Commit d02ba2a6110c ("l2tp: fix race in pppol2tp_release with session
object destroy") tried to fix a race condition where a PPPoL2TP socket
would disappear while the L2TP session was still using it. However, it
missed the root issue which is that an L2TP session may accept to be
reconnected if its associated socket has entered the release process.

The tentative fix makes the session hold the socket it is connected to.
That saves the kernel from crashing, but introduces refcount leakage,
preventing the socket from completing the release process. Once stalled,
everything the socket depends on can't be released anymore, including
the L2TP session and the l2tp_ppp module.

The root issue is that, when releasing a connected PPPoL2TP socket, the
session's ->sk pointer (RCU-protected) is reset to NULL and we have to
wait for a grace period before destroying the socket. The socket drops
the session in its ->sk_destruct callback function, so the session
will exist until the last reference on the socket is dropped.
Therefore, there is a time frame where pppol2tp_connect() may accept
reconnecting a session, as it only checks ->sk to figure out if the
session is connected. This time frame is shortened by the fact that
pppol2tp_release() calls l2tp_session_delete(), making the session
unreachable before resetting ->sk. However, pppol2tp_connect() may
grab the session before it gets unhashed by l2tp_session_delete(), but
it may test ->sk after the later got reset. The race is not so hard to
trigger and syzbot found a pretty reliable reproducer:
https://syzkaller.appspot.com/bug?id=418578d2a4389074524e04d641eacb091961b2cf

Before d02ba2a6110c, another race could let pppol2tp_release()
overwrite the ->__sk pointer of an L2TP session, thus tricking
pppol2tp_put_sk() into calling sock_put() on a socket that is different
than the one for which pppol2tp_release() was originally called. To get
there, we had to trigger the race described above, therefore having one
PPPoL2TP socket being released, while the session it is connected to is
reconnecting to a different PPPoL2TP socket. When releasing this new
socket fast enough, pppol2tp_release() overwrites the session's
->__sk pointer with the address of the new socket, before the first
pppol2tp_put_sk() call gets scheduled. Then the pppol2tp_put_sk() call
invoked by the original socket will sock_put() the new socket,
potentially dropping its last reference. When the second
pppol2tp_put_sk() finally runs, its socket has already been freed.

With d02ba2a6110c, the session takes a reference on both sockets.
Furthermore, the session's ->sk pointer is reset in the
pppol2tp_session_close() callback function rather than in
pppol2tp_release(). Therefore, ->__sk can't be overwritten and
pppol2tp_put_sk() is called only once (l2tp_session_delete() will only
run pppol2tp_session_close() once, to protect the session against
concurrent deletion requests). Now pppol2tp_put_sk() will properly
sock_put() the original socket, but the new socket will remain, as
l2tp_session_delete() prevented the release process from completing.
Here, we don't depend on the ->__sk race to trigger the bug. Getting
into the pppol2tp_connect() race is enough to leak the reference, no
matter when new socket is released.

So it all boils down to pppol2tp_connect() failing to realise that the
session has already been connected. This patch drops the unneeded extra
reference counting (mostly reverting d02ba2a6110c) and checks that
neither ->sk nor ->__sk is set before allowing a session to be
connected.

Fixes: d02ba2a6110c ("l2tp: fix race in pppol2tp_release with session object destroy")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ppp.c | 35 +++++++++++++++++------------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
index 1fd9e145076a..466f17646625 100644
--- a/net/l2tp/l2tp_ppp.c
+++ b/net/l2tp/l2tp_ppp.c
@@ -428,16 +428,6 @@ static void pppol2tp_put_sk(struct rcu_head *head)
  */
 static void pppol2tp_session_close(struct l2tp_session *session)
 {
-	struct pppol2tp_session *ps;
-
-	ps = l2tp_session_priv(session);
-	mutex_lock(&ps->sk_lock);
-	ps->__sk = rcu_dereference_protected(ps->sk,
-					     lockdep_is_held(&ps->sk_lock));
-	RCU_INIT_POINTER(ps->sk, NULL);
-	if (ps->__sk)
-		call_rcu(&ps->rcu, pppol2tp_put_sk);
-	mutex_unlock(&ps->sk_lock);
 }
 
 /* Really kill the session socket. (Called from sock_put() if
@@ -480,15 +470,24 @@ static int pppol2tp_release(struct socket *sock)
 	sock_orphan(sk);
 	sock->sk = NULL;
 
-	/* If the socket is associated with a session,
-	 * l2tp_session_delete will call pppol2tp_session_close which
-	 * will drop the session's ref on the socket.
-	 */
 	session = pppol2tp_sock_to_session(sk);
 	if (session) {
+		struct pppol2tp_session *ps;
+
 		l2tp_session_delete(session);
-		/* drop the ref obtained by pppol2tp_sock_to_session */
-		sock_put(sk);
+
+		ps = l2tp_session_priv(session);
+		mutex_lock(&ps->sk_lock);
+		ps->__sk = rcu_dereference_protected(ps->sk,
+						     lockdep_is_held(&ps->sk_lock));
+		RCU_INIT_POINTER(ps->sk, NULL);
+		mutex_unlock(&ps->sk_lock);
+		call_rcu(&ps->rcu, pppol2tp_put_sk);
+
+		/* Rely on the sock_put() call at the end of the function for
+		 * dropping the reference held by pppol2tp_sock_to_session().
+		 * The last reference will be dropped by pppol2tp_put_sk().
+		 */
 	}
 
 	release_sock(sk);
@@ -742,7 +741,8 @@ static int pppol2tp_connect(struct socket *sock, struct sockaddr *uservaddr,
 		 */
 		mutex_lock(&ps->sk_lock);
 		if (rcu_dereference_protected(ps->sk,
-					      lockdep_is_held(&ps->sk_lock))) {
+					      lockdep_is_held(&ps->sk_lock)) ||
+		    ps->__sk) {
 			mutex_unlock(&ps->sk_lock);
 			error = -EEXIST;
 			goto end;
@@ -803,7 +803,6 @@ static int pppol2tp_connect(struct socket *sock, struct sockaddr *uservaddr,
 
 out_no_ppp:
 	/* This is how we get the session context from the socket. */
-	sock_hold(sk);
 	sk->sk_user_data = session;
 	rcu_assign_pointer(ps->sk, sk);
 	mutex_unlock(&ps->sk_lock);
-- 
2.17.1

^ permalink raw reply related

* [PATCH V2 net] net: hns: Fix the process of adding broadcast addresses to tcam
From: Salil Mehta @ 2018-06-04 16:50 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
	linux-kernel, linuxarm, Xi Wang

From: Xi Wang <wangxi11@huawei.com>

If the multicast mask value in device tree is configured not all
0xff, the broadcast mac will be lost from tcam table after the
execution of command 'ifconfig up'. The address is appended by
hns_ae_start, but will be clear later by hns_nic_set_rx_mode
called in dev_open process.

This patch fixed it by not use the multicast mask when add a
broadcast address.

Fixes: b5996f11ea54 ("net: add Hisilicon Network Subsystem basic ethernet support")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
Patch V2: Fixed comments by David Miller
   Link: https://lkml.org/lkml/2018/6/4/671
Patch V1: Initial Submit
---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c | 23 +++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
index e0bc79e..85e1d14 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
@@ -1648,6 +1648,15 @@ int hns_dsaf_rm_mac_addr(
 				      mac_entry->addr);
 }
 
+static void hns_dsaf_setup_mc_mask(struct dsaf_device *dsaf_dev,
+				   u8 port_num, u8 *mask, u8 *addr)
+{
+	if (MAC_IS_BROADCAST(addr))
+		memset(mask, 0xff, ETH_ALEN);
+	else
+		memcpy(mask, dsaf_dev->mac_cb[port_num]->mc_mask, ETH_ALEN);
+}
+
 static void hns_dsaf_mc_mask_bit_clear(char *dst, const char *src)
 {
 	u16 *a = (u16 *)dst;
@@ -1676,7 +1685,6 @@ int hns_dsaf_add_mac_mc_port(struct dsaf_device *dsaf_dev,
 	struct dsaf_drv_tbl_tcam_key tmp_mac_key;
 	struct dsaf_tbl_tcam_data tcam_data;
 	u8 mc_addr[ETH_ALEN];
-	u8 *mc_mask;
 	int mskid;
 
 	/*chechk mac addr */
@@ -1687,9 +1695,12 @@ int hns_dsaf_add_mac_mc_port(struct dsaf_device *dsaf_dev,
 	}
 
 	ether_addr_copy(mc_addr, mac_entry->addr);
-	mc_mask = dsaf_dev->mac_cb[mac_entry->in_port_num]->mc_mask;
 	if (!AE_IS_VER1(dsaf_dev->dsaf_ver)) {
+		u8 mc_mask[ETH_ALEN];
+
 		/* prepare for key data setting */
+		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
+				       mc_mask, mac_entry->addr);
 		hns_dsaf_mc_mask_bit_clear(mc_addr, mc_mask);
 
 		/* config key mask */
@@ -1844,7 +1855,6 @@ int hns_dsaf_del_mac_mc_port(struct dsaf_device *dsaf_dev,
 	struct dsaf_drv_tbl_tcam_key mask_key, tmp_mac_key;
 	struct dsaf_tbl_tcam_data *pmask_key = NULL;
 	u8 mc_addr[ETH_ALEN];
-	u8 *mc_mask;
 
 	if (!(void *)mac_entry) {
 		dev_err(dsaf_dev->dev,
@@ -1861,14 +1871,17 @@ int hns_dsaf_del_mac_mc_port(struct dsaf_device *dsaf_dev,
 
 	/* always mask vlan_id field */
 	ether_addr_copy(mc_addr, mac_entry->addr);
-	mc_mask = dsaf_dev->mac_cb[mac_entry->in_port_num]->mc_mask;
 
 	if (!AE_IS_VER1(dsaf_dev->dsaf_ver)) {
+		u8 mc_mask[ETH_ALEN];
+
 		/* prepare for key data setting */
+		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
+				       mc_mask, mac_entry->addr);
 		hns_dsaf_mc_mask_bit_clear(mc_addr, mc_mask);
 
 		/* config key mask */
-		hns_dsaf_set_mac_key(dsaf_dev, &mask_key, 0x00, 0xff, mc_addr);
+		hns_dsaf_set_mac_key(dsaf_dev, &mask_key, 0x00, 0xff, mc_mask);
 
 		mask_key.high.val = le32_to_cpu(mask_key.high.val);
 		mask_key.low.val = le32_to_cpu(mask_key.low.val);
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH bpf-next 00/11] AF_XDP: introducing zero-copy support
From: Alexei Starovoitov @ 2018-06-04 16:38 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev,
	Björn Töpel, john.fastabend, willemdebruijn.kernel, mst,
	michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	francois.ozog, ilias.apalodimas, brian.brooks, andy, michael.chan,
	intel-wired-lan
In-Reply-To: <20180604120601.18123-1-bjorn.topel@gmail.com>

On Mon, Jun 04, 2018 at 02:05:50PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This patch serie introduces zerocopy (ZC) support for
> AF_XDP. Programs using AF_XDP sockets will now receive RX packets
> without any copies and can also transmit packets without incurring any
> copies. No modifications to the application are needed, but the NIC
> driver needs to be modified to support ZC. If ZC is not supported by
> the driver, the modes introduced in the AF_XDP patch will be
> used. Using ZC in our micro benchmarks results in significantly
> improved performance as can be seen in the performance section later
> in this cover letter.
> 
> Note that for an untrusted application, HW packet steering to a
> specific queue pair (the one associated with the application) is a
> requirement when using ZC, as the application would otherwise be able
> to see other user space processes' packets. If the HW cannot support
> the required packet steering you need to use the XDP_SKB mode or the
> XDP_DRV mode without ZC turned on. The XSKMAP introduced in the AF_XDP
> patch set can be used to do load balancing in that case.
> 
> For benchmarking, you can use the xdpsock application from the AF_XDP
> patch set without any modifications. Say that you would like your UDP
> traffic from port 4242 to end up in queue 16, that we will enable
> AF_XDP on. Here, we use ethtool for this:
> 
>       ethtool -N p3p2 rx-flow-hash udp4 fn
>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>           action 16
> 
> Running the rxdrop benchmark in XDP_DRV mode with zerocopy can then be
> done using:
> 
>       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
> 
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> NIC is Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by a commercial packet generator HW
> outputing packets at full 40 Gbit/s line rate. The results are without
> retpoline so that we can compare against previous numbers. 
> 
> AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch
> set are also reported for ease of reference. The numbers within
> parantheses are from the RFC V1 ZC patch set.
> Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> rxdrop       2.9*       9.6*       21.1(21.5)
> txpush       2.6*       -          22.0(21.6)
> l2fwd        1.9*       2.5*       15.3(15.0)
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> rxdrop       2.1*       3.3*       3.3(3.3)
> l2fwd        1.4*       1.8*       3.1(3.1)
> 
> * From AF_XDP V3 patch set and cover letter.
> 
> So why do we not get higher values for RX similar to the 34 Mpps we
> had in AF_PACKET V4? We made an experiment running the rxdrop
> benchmark without using the xdp_do_redirect/flush infrastructure nor
> using an XDP program (all traffic on a queue goes to one
> socket). Instead the driver acts directly on the AF_XDP socket. With
> this we got 36.9 Mpps, a significant improvement without any change to
> the uapi. So not forcing users to have an XDP program if they do not
> need it, might be a good idea. This measurement is actually higher
> than what we got with AF_PACKET V4.
> 
> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32.3M  0
> 
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3.3M    0
> 
> The structure of the patch set is as follows:
> 
> Patches 1-3: Plumbing for AF_XDP ZC support
> Patches 4-5: AF_XDP ZC for RX
> Patches 6-7: AF_XDP ZC for TX

Acked-by: Alexei Starovoitov <ast@kernel.org>
for above patches

> Patch 8-10: ZC support for i40e.

these also look good to me.
would be great if i40e experts take a look at them asap.

If there are no major objections we'd like to merge all of it
for this merge window.

Thanks!

^ permalink raw reply

* Re: [PATCH net-next 0/2] cls_flower: Various fixes
From: Cong Wang @ 2018-06-04 16:45 UTC (permalink / raw)
  To: Roi Dayan
  Cc: Jiri Pirko, Paul Blakey, Jiri Pirko, Jamal Hadi Salim,
	David Miller, Linux Kernel Network Developers, Yevgeny Kliteynik,
	Shahar Klein, Mark Bloch, Or Gerlitz
In-Reply-To: <695d484a-b981-054a-1da0-e63719bc9100@mellanox.com>

On Mon, Jun 4, 2018 at 12:35 AM, Roi Dayan <roid@mellanox.com> wrote:
>
>
> On 03/06/2018 22:39, Jiri Pirko wrote:
>>
>> Sun, Jun 03, 2018 at 08:33:25PM CEST, xiyou.wangcong@gmail.com wrote:
>>>
>>> On Wed, May 30, 2018 at 1:17 AM, Paul Blakey <paulb@mellanox.com> wrote:
>>>>
>>>> Two of the fixes are for my multiple mask patch
>>>>
>>>> Paul Blakey (2):
>>>>    cls_flower: Fix missing free of rhashtable
>>>>    cls_flower: Fix comparing of old filter mask with new filter
>>>
>>>
>>> Both are bug fixes and one-line fixes, so definitely should go
>>> to -net tree and -stable tree.
>>
>>
>> I agree.
>>
>
> it's because the commit they fix doesn't exists in net yet.
>

Oh, sorry, my bad, I thought Apr 30 is in a previous release...

Anyway, v2 should be applied cleanly to -net-next or -net after
net-next is merged into it.

^ permalink raw reply

* [PATCH bpf-next v2 2/2] samples/bpf: Add xdp_sample_pkts example
From: Toke Høiland-Jørgensen @ 2018-06-04 16:33 UTC (permalink / raw)
  To: netdev
In-Reply-To: <152813003609.3465.618891361534945522.stgit@alrua-kau>

This adds an example program showing how to sample packets from XDP using
the perf event buffer. The example userspace program just prints the
ethernet header for every packet sampled.

The example sets up a perf file descriptor per CPU, allowing the XDP
program to pass BPF_F_CURRENT_CPU and work no matter which CPU handles the
packet.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 samples/bpf/Makefile               |    4 +
 samples/bpf/xdp_sample_pkts_kern.c |   62 ++++++++++++++
 samples/bpf/xdp_sample_pkts_user.c |  162 ++++++++++++++++++++++++++++++++++++
 3 files changed, 228 insertions(+)
 create mode 100644 samples/bpf/xdp_sample_pkts_kern.c
 create mode 100644 samples/bpf/xdp_sample_pkts_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 1303af10e54d..6f0c6d276a86 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -52,6 +52,7 @@ hostprogs-y += xdp_adjust_tail
 hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
+hostprogs-y += xdp_sample_pkts
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -107,6 +108,7 @@ xdp_adjust_tail-objs := xdp_adjust_tail_user.o
 xdpsock-objs := bpf_load.o xdpsock_user.o
 xdp_fwd-objs := bpf_load.o xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
+xdp_sample_pkts-objs := bpf_load.o xdp_sample_pkts_user.o $(TRACE_HELPERS)
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -163,6 +165,7 @@ always += xdp_adjust_tail_kern.o
 always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
+always += xdp_sample_pkts_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -179,6 +182,7 @@ HOSTCFLAGS_spintest_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_trace_event_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_sampleip_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_task_fd_query_user.o += -I$(srctree)/tools/lib/bpf/
+HOSTCFLAGS_xdp_sample_pkts_user.o += -I$(srctree)/tools/lib/bpf/
 
 HOST_LOADLIBES		+= $(LIBBPF) -lelf
 HOSTLOADLIBES_tracex4		+= -lrt
diff --git a/samples/bpf/xdp_sample_pkts_kern.c b/samples/bpf/xdp_sample_pkts_kern.c
new file mode 100644
index 000000000000..4560522ca015
--- /dev/null
+++ b/samples/bpf/xdp_sample_pkts_kern.c
@@ -0,0 +1,62 @@
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+#define SAMPLE_SIZE 64ul
+#define MAX_CPUS 24
+
+#define bpf_printk(fmt, ...)					\
+({								\
+	       char ____fmt[] = fmt;				\
+	       bpf_trace_printk(____fmt, sizeof(____fmt),	\
+				##__VA_ARGS__);			\
+})
+
+struct bpf_map_def SEC("maps") my_map = {
+	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(u32),
+	.max_entries = MAX_CPUS,
+};
+
+SEC("xdp_sample")
+int xdp_sample_prog(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+
+        /* Metadata will be in the perf event before the packet data. */
+	struct S {
+		u16 cookie;
+		u16 pkt_len;
+	} __attribute__((packed)) metadata;
+
+	if (data + SAMPLE_SIZE < data_end) {
+		/* The XDP perf_event_output handler will use the upper 32 bits
+		 * of the flags argument as a number of bytes to include of the
+		 * packet payload in the event data. If the size is too big, the
+		 * call to bpf_perf_event_output will fail and return -EFAULT.
+		 *
+		 * See bpf_xdp_event_output in net/core/filter.c.
+		 *
+		 * The BPF_F_CURRENT_CPU flag means that the event output fd
+		 * will be indexed by the CPU number in the event map.
+		 */
+		u64 flags = (SAMPLE_SIZE << 32) | BPF_F_CURRENT_CPU;
+		int ret;
+
+		metadata.cookie = 0xdead;
+		metadata.pkt_len = (u16)(data_end - data);
+
+		ret = bpf_perf_event_output(ctx, &my_map, flags,
+				      &metadata, sizeof(metadata));
+		if(ret)
+			bpf_printk("perf_event_output failed: %d\n", ret);
+	}
+
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/xdp_sample_pkts_user.c b/samples/bpf/xdp_sample_pkts_user.c
new file mode 100644
index 000000000000..35c5dd953f48
--- /dev/null
+++ b/samples/bpf/xdp_sample_pkts_user.c
@@ -0,0 +1,162 @@
+/* This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <linux/perf_event.h>
+#include <linux/bpf.h>
+#include <net/if.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/sysinfo.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <signal.h>
+#include <libbpf.h>
+#include "bpf_load.h"
+#include "bpf_util.h"
+#include <bpf/bpf.h>
+
+#include "perf-sys.h"
+#include "trace_helpers.h"
+
+#define MAX_CPUS 24
+static int pmu_fds[MAX_CPUS], if_idx = 0;
+static struct perf_event_mmap_page *headers[MAX_CPUS];
+static char *if_name;
+
+static int do_attach(int idx, int fd, const char *name)
+{
+	int err;
+
+	err = bpf_set_link_xdp_fd(idx, fd, 0);
+	if (err < 0)
+		printf("ERROR: failed to attach program to %s\n", name);
+
+	return err;
+}
+
+static int do_detach(int idx, const char *name)
+{
+	int err;
+
+	err = bpf_set_link_xdp_fd(idx, -1, 0);
+	if (err < 0)
+		printf("ERROR: failed to detach program from %s\n", name);
+
+	return err;
+}
+
+#define SAMPLE_SIZE 64
+
+static int print_bpf_output(void *data, int size)
+{
+	struct {
+		__u16 cookie;
+		__u16 pkt_len;
+		__u8  pkt_data[SAMPLE_SIZE];
+	} __attribute__((packed)) *e = data;
+	int i;
+
+	if (e->cookie != 0xdead) {
+		printf("BUG cookie %x sized %d\n",
+		       e->cookie, size);
+		return LIBBPF_PERF_EVENT_ERROR;
+	}
+
+	printf("Pkt len: %-5d bytes. Ethernet hdr: ", e->pkt_len);
+	for (i = 0; i < 14 && i < e->pkt_len; i++)
+		printf("%02x ", e->pkt_data[i]);
+	printf("\n");
+
+	return LIBBPF_PERF_EVENT_CONT;
+}
+
+static void test_bpf_perf_event(int num)
+{
+	struct perf_event_attr attr = {
+		.sample_type = PERF_SAMPLE_RAW,
+		.type = PERF_TYPE_SOFTWARE,
+		.config = PERF_COUNT_SW_BPF_OUTPUT,
+		.wakeup_events = 1, /* get an fd notification for every event */
+	};
+	int i;
+
+	for (i = 0; i < num; i++) {
+		int key = i;
+
+		pmu_fds[i] = sys_perf_event_open(&attr, -1/*pid*/, i/*cpu*/, -1/*group_fd*/, 0);
+
+		assert(pmu_fds[i] >= 0);
+		assert(bpf_map_update_elem(map_fd[0], &key, &pmu_fds[i], BPF_ANY) == 0);
+		ioctl(pmu_fds[i], PERF_EVENT_IOC_ENABLE, 0);
+	}
+}
+
+static void sig_handler(int signo)
+{
+	do_detach(if_idx, if_name);
+	exit(0);
+}
+
+int main(int argc, char **argv)
+{
+	char filename[256];
+	int ret, err;
+	int numcpus;
+	int i;
+
+	if (argc < 2) {
+		printf("Usage: %s <ifname>\n", argv[0]);
+		return 1;
+	}
+
+	numcpus = get_nprocs();
+	if (numcpus > MAX_CPUS)
+		numcpus = MAX_CPUS;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	if_idx = if_nametoindex(argv[1]);
+	if (!if_idx)
+		if_idx = strtoul(argv[1], NULL, 0);
+
+	if (!if_idx) {
+		fprintf(stderr, "Invalid ifname\n");
+		return 1;
+	}
+	if_name = argv[1];
+	err = do_attach(if_idx, prog_fd[0], argv[1]);
+	if (err)
+		return err;
+
+	if (signal(SIGINT, sig_handler) ||
+	    signal(SIGHUP, sig_handler) ||
+	    signal(SIGTERM, sig_handler)) {
+		perror("signal");
+		return 1;
+	}
+
+	test_bpf_perf_event(numcpus);
+
+	for (i = 0; i < numcpus; i++)
+		if (perf_event_mmap_header(pmu_fds[i], &headers[i]) < 0)
+			return 1;
+
+	ret = perf_event_poller_multi(pmu_fds, headers, numcpus, print_bpf_output);
+	kill(0, SIGINT);
+	return ret;
+}

^ permalink raw reply related

* [PATCH bpf-next v2 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events
From: Toke Høiland-Jørgensen @ 2018-06-04 16:33 UTC (permalink / raw)
  To: netdev

This adds two new helper functions to trace_helpers that supports polling
multiple perf file descriptors for events. These are used to the XDP
perf_event_output example, which needs to work with one perf fd per CPU.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 tools/testing/selftests/bpf/trace_helpers.c |   47 ++++++++++++++++++++++++++-
 tools/testing/selftests/bpf/trace_helpers.h |    4 ++
 2 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
index 3868dcb63420..1e62d89f34cf 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -88,7 +88,7 @@ static int page_size;
 static int page_cnt = 8;
 static struct perf_event_mmap_page *header;
 
-int perf_event_mmap(int fd)
+int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header)
 {
 	void *base;
 	int mmap_size;
@@ -102,10 +102,15 @@ int perf_event_mmap(int fd)
 		return -1;
 	}
 
-	header = base;
+	*header = base;
 	return 0;
 }
 
+int perf_event_mmap(int fd)
+{
+	return perf_event_mmap_header(fd, &header);
+}
+
 static int perf_event_poll(int fd)
 {
 	struct pollfd pfd = { .fd = fd, .events = POLLIN };
@@ -163,3 +168,41 @@ int perf_event_poller(int fd, perf_event_print_fn output_fn)
 
 	return ret;
 }
+
+int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
+			    int num_fds, perf_event_print_fn output_fn)
+{
+	enum bpf_perf_event_ret ret;
+	struct pollfd *pfds;
+	void *buf = NULL;
+	size_t len = 0;
+	int i;
+
+	pfds = malloc(sizeof(*pfds) * num_fds);
+	if (!pfds)
+		return -1;
+
+	memset(pfds, 0, sizeof(*pfds) * num_fds);
+	for (i = 0; i < num_fds; i++) {
+		pfds[i].fd = fds[i];
+		pfds[i].events = POLLIN;
+	}
+
+	for (;;) {
+		poll(pfds, num_fds, 1000);
+		for (i = 0; i < num_fds; i++) {
+			if (pfds[i].revents) {
+				ret = bpf_perf_event_read_simple(headers[i], page_cnt * page_size,
+								page_size, &buf, &len,
+								bpf_perf_event_print,
+								output_fn);
+				if (ret != LIBBPF_PERF_EVENT_CONT)
+					break;
+			}
+		}
+	}
+	free(buf);
+	free(pfds);
+
+	return ret;
+}
diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
index 3b4bcf7f5084..18924f23db1b 100644
--- a/tools/testing/selftests/bpf/trace_helpers.h
+++ b/tools/testing/selftests/bpf/trace_helpers.h
@@ -3,6 +3,7 @@
 #define __TRACE_HELPER_H
 
 #include <libbpf.h>
+#include <linux/perf_event.h>
 
 struct ksym {
 	long addr;
@@ -16,6 +17,9 @@ long ksym_get_addr(const char *name);
 typedef enum bpf_perf_event_ret (*perf_event_print_fn)(void *data, int size);
 
 int perf_event_mmap(int fd);
+int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header);
 /* return LIBBPF_PERF_EVENT_DONE or LIBBPF_PERF_EVENT_ERROR */
 int perf_event_poller(int fd, perf_event_print_fn output_fn);
+int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
+			    int num_fds, perf_event_print_fn output_fn);
 #endif

^ permalink raw reply related

* Re: [PATCH bpf-next 0/5] AF_XDP: bug fixes and descriptor changes
From: Alexei Starovoitov @ 2018-06-04 16:24 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, mykyta.iziumtsev,
	Björn Töpel, john.fastabend, willemdebruijn.kernel, mst,
	michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	francois.ozog, ilias.apalodimas, brian.brooks, andy, michael.chan
In-Reply-To: <20180604115715.17895-1-bjorn.topel@gmail.com>

On Mon, Jun 04, 2018 at 01:57:10PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> An issue with the current AF_XDP uapi raised by Mykyta Iziumtsev (see
> https://www.spinics.net/lists/netdev/msg503664.html) is that it does
> not support NICs that have a "type-writer" model in an efficient
> way. In this model, a memory window is passed to the hardware and
> multiple frames might be filled into that window, instead of just one
> that we have in the current fixed frame-size model.
> 
> This patch set fixes two bugs in the current implementation and then
> changes the uapi so that the type-writer model can be supported
> efficiently by a possible future extension of AF_XDP.
> 
> These are the uapi changes in this patch:
> 
> * Change the "u32 idx" in the descriptors to "u64 addr". The current
>   idx based format does NOT work for the type-writer model (as packets
>   can start anywhere within a frame) but that a relative address
>   pointer (the u64 addr) works well for both models in the prototype
>   code we have that supports both models. We increased it from u32 to
>   u64 to support umems larger than 4G. We have also removed the u16
>   offset when having a "u64 addr" since that information is already
>   carried in the least significant bits of the address.
> 
> * We want to use "u8 padding[5]" for something useful in the future
>   (since we are not allowed to change its name), so we now call it
>   just options so it can be extended for various purposes in the
>   future. It is an u32 as that it what is left of the 16 byte
>   descriptor.
> 
> * We changed the name of frame_size in the UMEM_REG setsockopt to
>   chunk_size since this naming also makes sense to the type-writer
>   model.
> 
> With these changes to the uapi, we believe the type-writer model can
> be supported without having to resort to a new descriptor format. The
> type-writer model could then be supported, from the uapi point of
> view, by setting a flag at bind time and providing a new flag bit in
> the options field of the descriptor that signals to user space that
> all packets have been written in a chunk. Or with a new chunk
> completion queue as suggested by Mykyta in his latest feedback mail on
> the list.

for the set:
Acked-by: Alexei Starovoitov <ast@kernel.org>
Thank you for these fixes.
According to unofficial feedback from brcm and netronome folks
the descriptor format should work for these nics too.
At some point we may consider second format, but I think SW
should drive HW requirements and not the other way around.

^ permalink raw reply

* Re: [PATCH net] VSOCK: check sk state before receive
From: Jorgen S. Hansen @ 2018-06-04 16:02 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Hangbin Liu, netdev@vger.kernel.org, David S. Miller
In-Reply-To: <20180530091727.GF14623@stefanha-x1.localdomain>


> On May 30, 2018, at 11:17 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Sun, May 27, 2018 at 11:29:45PM +0800, Hangbin Liu wrote:
>> Hmm...Although I won't reproduce this bug with my reproducer after
>> apply my patch. I could still get a similiar issue with syzkaller sock vnet test.
>> 
>> It looks this patch is not complete. Here is the KASAN call trace with my patch.
>> I can also reproduce it without my patch.
> 
> Seems like a race between vmci_datagram_destroy_handle() and the
> delayed callback, vmci_transport_recv_dgram_cb().
> 
> I don't know the VMCI transport well so I'll leave this to Jorgen.

Yes, it looks like we are calling the delayed callback after we return from vmci_datagram_destroy_handle(). I’ll take a closer look at the VMCI side here - the refcounting of VMCI datagram endpoints should guard against this, since the delayed callback does a get on the datagram resource, so this could a VMCI driver issue, and not a problem in the VMCI transport for AF_VSOCK.

> 
>> ==================================================================
>> BUG: KASAN: use-after-free in vmci_transport_allow_dgram.part.7+0x155/0x1a0 [vmw_vsock_vmci_transport]
>> Read of size 4 at addr ffff880026a3a914 by task kworker/0:2/96
>> 
>> CPU: 0 PID: 96 Comm: kworker/0:2 Not tainted 4.17.0-rc6.vsock+ #28
>> Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>> Workqueue: events dg_delayed_dispatch [vmw_vmci]
>> Call Trace:
>> __dump_stack lib/dump_stack.c:77 [inline]
>> dump_stack+0xdd/0x18e lib/dump_stack.c:113
>> print_address_description+0x7a/0x3e0 mm/kasan/report.c:256
>> kasan_report_error mm/kasan/report.c:354 [inline]
>> kasan_report+0x1dd/0x460 mm/kasan/report.c:412
>> vmci_transport_allow_dgram.part.7+0x155/0x1a0 [vmw_vsock_vmci_transport]
>> vmci_transport_recv_dgram_cb+0x5d/0x200 [vmw_vsock_vmci_transport]
>> dg_delayed_dispatch+0x99/0x1b0 [vmw_vmci]
>> process_one_work+0xa4e/0x1720 kernel/workqueue.c:2145
>> worker_thread+0x1df/0x1400 kernel/workqueue.c:2279
>> kthread+0x343/0x4b0 kernel/kthread.c:240
>> ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:412
>> 
>> Allocated by task 2684:
>> set_track mm/kasan/kasan.c:460 [inline]
>> kasan_kmalloc+0xa0/0xd0 mm/kasan/kasan.c:553
>> slab_post_alloc_hook mm/slab.h:444 [inline]
>> slab_alloc_node mm/slub.c:2741 [inline]
>> slab_alloc mm/slub.c:2749 [inline]
>> kmem_cache_alloc+0x105/0x330 mm/slub.c:2754
>> sk_prot_alloc+0x6a/0x2c0 net/core/sock.c:1468
>> sk_alloc+0xc9/0xbb0 net/core/sock.c:1528
>> __vsock_create+0xc8/0x9b0 [vsock]
>> vsock_create+0xfd/0x1a0 [vsock]
>> __sock_create+0x310/0x690 net/socket.c:1285
>> sock_create net/socket.c:1325 [inline]
>> __sys_socket+0x101/0x240 net/socket.c:1355
>> __do_sys_socket net/socket.c:1364 [inline]
>> __se_sys_socket net/socket.c:1362 [inline]
>> __x64_sys_socket+0x7d/0xd0 net/socket.c:1362
>> do_syscall_64+0x175/0x630 arch/x86/entry/common.c:287
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> 
>> Freed by task 2684:
>> set_track mm/kasan/kasan.c:460 [inline]
>> __kasan_slab_free+0x130/0x180 mm/kasan/kasan.c:521
>> slab_free_hook mm/slub.c:1388 [inline]
>> slab_free_freelist_hook mm/slub.c:1415 [inline]
>> slab_free mm/slub.c:2988 [inline]
>> kmem_cache_free+0xce/0x410 mm/slub.c:3004
>> sk_prot_free net/core/sock.c:1509 [inline]
>> __sk_destruct+0x629/0x940 net/core/sock.c:1593
>> sk_destruct+0x4e/0x90 net/core/sock.c:1601
>> __sk_free+0xd3/0x320 net/core/sock.c:1612
>> sk_free+0x2a/0x30 net/core/sock.c:1623
>> __vsock_release+0x431/0x610 [vsock]
>> vsock_release+0x3c/0xc0 [vsock]
>> sock_release+0x91/0x200 net/socket.c:594
>> sock_close+0x17/0x20 net/socket.c:1149
>> __fput+0x368/0xa20 fs/file_table.c:209
>> task_work_run+0x1c5/0x2a0 kernel/task_work.c:113
>> exit_task_work include/linux/task_work.h:22 [inline]
>> do_exit+0x1876/0x26c0 kernel/exit.c:865
>> do_group_exit+0x159/0x3e0 kernel/exit.c:968
>> get_signal+0x65a/0x1780 kernel/signal.c:2482
>> do_signal+0xa4/0x1fe0 arch/x86/kernel/signal.c:810
>> exit_to_usermode_loop+0x1b8/0x260 arch/x86/entry/common.c:162
>> prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
>> syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
>> do_syscall_64+0x505/0x630 arch/x86/entry/common.c:290
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> 
>> The buggy address belongs to the object at ffff880026a3a600
>> which belongs to the cache AF_VSOCK of size 1056
>> The buggy address is located 788 bytes inside of
>> 1056-byte region [ffff880026a3a600, ffff880026a3aa20)
>> The buggy address belongs to the page:
>> page:ffffea00009a8e00 count:1 mapcount:0 mapping:0000000000000000 index:0x0 compound_mapcount: 0
>> flags: 0xfffffc0008100(slab|head)
>> raw: 000fffffc0008100 0000000000000000 0000000000000000 00000001000d000d
>> raw: dead000000000100 dead000000000200 ffff880034471a40 0000000000000000
>> page dumped because: kasan: bad access detected
>> 
>> Memory state around the buggy address:
>> ffff880026a3a800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff880026a3a880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>> ffff880026a3a900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>                         ^
>> ffff880026a3a980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff880026a3aa00: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
>> ==================================================================


^ permalink raw reply

* [PATCH bpf-next] bpf: guard bpf_get_current_cgroup_id() with CONFIG_CGROUPS
From: Yonghong Song @ 2018-06-04 15:53 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team

Commit bf6fa2c893c5 ("bpf: implement bpf_get_current_cgroup_id()
helper") introduced a new helper bpf_get_current_cgroup_id().
The helper has a dependency on CONFIG_CGROUPS.

When CONFIG_CGROUPS is not defined, using the helper will result
the following verifier error:
  kernel subsystem misconfigured func bpf_get_current_cgroup_id#80
which is hard for users to interpret.
Guarding the reference to bpf_get_current_cgroup_id_proto with
CONFIG_CGROUPS will result in below better message:
  unknown func bpf_get_current_cgroup_id#80

Fixes: bf6fa2c893c5 ("bpf: implement bpf_get_current_cgroup_id() helper")
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/trace/bpf_trace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index e2ab5b7..0ae6829 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -564,8 +564,10 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_prandom_u32_proto;
 	case BPF_FUNC_probe_read_str:
 		return &bpf_probe_read_str_proto;
+#ifdef CONFIG_CGROUPS
 	case BPF_FUNC_get_current_cgroup_id:
 		return &bpf_get_current_cgroup_id_proto;
+#endif
 	default:
 		return NULL;
 	}
-- 
2.9.5

^ permalink raw reply related

* Re: [RFC feedback] AF_XDP and non-Intel hardware
From: Björn Töpel @ 2018-06-04 15:43 UTC (permalink / raw)
  To: MykytaI Iziumtsev
  Cc: Netdev, Björn Töpel, Karlsson, Magnus, Zhang, Qi Z,
	Francois Ozog, Ilias Apalodimas, Brian Brooks,
	Jesper Dangaard Brouer, Andy Gospodarek, michael.chan,
	Luke Gorrie
In-Reply-To: <CAPvmOuEeag+R9sa_yGRy2ZVpmcofoa7Ey6y1To3qi=QV4BV6ww@mail.gmail.com>

Den mån 4 juni 2018 kl 11:44 skrev Mykyta Iziumtsev
<mykyta.iziumtsev@linaro.org>:
>
> Hi Björn and Magnus,
>
> Here is a short update on the proposed changes to AF_XDP.
>
> During implementation phase I figured out that 'end-of-page' in RX
> descriptor's flags won't work, unfortunately. This is because the
> kernel driver can't know if the frame will be last on the page when RX
> descriptor is being filled in. Imagine that the driver is processing a
> frame on a page, and there is still 1000 bytes of non utilized memory
> beyond this frame. If the next frame (which haven't yet arrived) is
> less than that -- the NIC can place it on the same page, otherwise the
> NIC will skip remaining 1000 bytes and take next page. Of course, the
> driver can update RX descriptor retrospectively, or use 'empty' RX
> descriptors just to indicate page completion, but it's too complex or
> inefficient and as such doesn't look good.
>

I think both "update RX descriptor retrospectively" and "zero-length
end-of-chunk descriptor" are OK (as long as the zero-length descs much
less common that the non-zero one). Out of curiosity, does Chelsio (or
any other type-writer based) NIC has a public data sheet? It would be
interesting to see how this is solved in practice.

> What I could propose instead is to make the 'filling' and 'page
> completion' symmetric. That is, UMEM will have 1 queue to pass pages
> from application to the kernel driver (fring in your terminology) and
> one to indicate complete pages back to the application.
>
> That 'page completion' queue can be added as a third queue to UMEM, or
> alternatively TX completions can be decoupled from UMEM to make it an
> simple 'memory area' object with two directional queues which work
> with RX pages only. TX completions can be handled in a simpler manner
> as part of TX queues: just make sure consumer index is advanced only
> when the frame is finally sent out. The TX completion processing can
> be then done by observing consume index advance from application. That
> would  not only be more in line with best NIC ring design practices
> but will improve cache locality and eliminate critical section  if TX
> queues are assigned to different CPUs.
>
...and this is also a valid option. Adding an additional umem ring for
end-of-chunk completions, might make sense, but let's the zero-copy
implementations drive that requirement.

As you probably noted Magnus and I just sent out a patch set with an
updated descriptor layout. A next step would be adding support for
type-writer styled NICs.

Thanks for the information, and hopefully we'll see an additional
AF_XDP zero-copy implementation on list (nudge, nudge). ;-)

Björn

> With best regards,
> Mykyta
>
> On 22 May 2018 at 14:09, Björn Töpel <bjorn.topel@gmail.com> wrote:
> > 2018-05-22 9:45 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumtsev@linaro.org>:
> >> On 21 May 2018 at 20:55, Björn Töpel <bjorn.topel@gmail.com> wrote:
> >>> 2018-05-21 14:34 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumtsev@linaro.org>:
> >>>> Hi Björn and Magnus,
> >>>>
> >>>> (This thread is a follow up to private dialogue. The intention is to
> >>>> let community know that AF_XDP can be enhanced further to make it
> >>>> compatible with wider range of NIC vendors).
> >>>>
> >>>
> >>> Mykyta, thanks for doing the write-up and sending it to the netdev
> >>> list! The timing could not be better -- we need to settle on an uapi
> >>> that works for all vendors prior enabling it in the kernel.
> >>>
> >>>> There are two NIC variations which don't fit well with current AF_XDP proposal.
> >>>>
> >>>> The first variation is represented by some NXP and Cavium NICs. AF_XDP
> >>>> expects NIC to put incoming frames into slots defined by UMEM area.
> >>>> Here slot size is set in XDP_UMEM_REG xdp_umem.reg.frame_size and
> >>>> slots available to NIC are communicated to the kernel via UMEM fill
> >>>> queue. While Intel NICs support only one slot size, NXP and Cavium
> >>>> support multiple slot sizes to optimize memory usage (e.g. { 128, 512,
> >>>> 2048 }, please refer to [1] for rationale). On frame reception the NIC
> >>>> pulls a slot from best fit pool based on frame size.
> >>>>
> >>>> The second variation is represented by e.g. Chelsio T5/T6 and Netcope
> >>>> NICs. As shown above, AF_XDP expects NIC to put incoming frames at
> >>>> predefined addresses. This is not the case here as the NIC is in
> >>>> charge of placing frames in memory based on it's own algorithm. For
> >>>> example, Chelsio T5/T6 is expecting to get whole pages from the driver
> >>>> and puts incoming frames on the page in a 'tape recorder' fashion.
> >>>> Assuming 'base' is page address and flen[N] is an array of frame
> >>>> lengths, the frame placement in memory will look like that:
> >>>>   base + 0
> >>>>   base + frame_len[0]
> >>>>   base + frame_len[0] + frame_len[1]
> >>>>   base + frame_len[0] + frame_len[1] + frame_len[2]
> >>>>   ...
> >>>>
> >>>> To better support these two NIC variations I suggest to abandon 'frame
> >>>> size' structuring in UMEM and stick with 'pages' instead. The NIC
> >>>> kernel driver is then responsible for splitting provided pages into
> >>>> slots expected by underlying HW (or not splitting at all in case of
> >>>> Chelsio/Netcope).
> >>>>
> >>>
> >>> Let's call the first variation "multi-pool" and the second
> >>> "type-writer" for simplicity. The type-writer model is very
> >>> interesting, and makes a lot of sense when the PCIe bus is a
> >>> constraint.
> >>>
> >>>> On XDP_UMEM_REG the application needs to specify page_size. Then the
> >>>> application can pass empty pages to the kernel driver using UMEM
> >>>> 'fill' queue by specifying page offset within the UMEM area. xdp_desc
> >>>> format needs to be changed as well: frame location will be defined by
> >>>> offset from the beginning of UMEM area instead of frame index. As
> >>>> payload headroom can vary with AF_XDP we'll need to specify it in
> >>>> xdp_desc as well. Beside that it could be worth to consider changing
> >>>> payload length to u16 as 64k+ frames aren't very common in networking.
> >>>> The resulting xdp_desc would look like that:
> >>>>
> >>>> struct xdp_desc {
> >>>>         __u64 offset;
> >>>>         __u16 headroom;
> >>>>         __u16 len;
> >>>>         __u8 flags;
> >>>>         __u8 padding[3];
> >>>> };
> >>>>
> >>>
> >>> Let's expand a bit here; Instead of passing indicies to fixed sized
> >>> frames to the fill ring, we now pass a memory region. For simplicity,
> >>> let's say a page. The fill ring descriptor requires offset and
> >>> len. The offset is a global offset from an UMEM perspective, and len
> >>> is the size of the region.
> >>>
> >>
> >> I would rather stick with region equal to page (regular or huge page,
> >> defined by application). The page size can be extracted from
> >> vm_area_struct in XDP_UMEM_REG (preferred) or configured by
> >> application.
> >>
> >
> > Ok, thinking more about it I prefer this as well. This means that we
> > only need to grow the UMEM fring/cring descriptors to u64, and not
> > care about length. As you state below, this makes the validation
> > simple.
> >
> > We might consider exposing a "page size hint", that the user can set.
> > For 4G hugepage scenario, it might make sense to have a chunk *less*
> > than 4G to avoid HW Rx memory running low when the end of a chunk is
> > approaching.
> >
> > As for THP, I need to think about proper behavior here.
> >
> >>> On the Rx ring the descriptor, as you wrote, must be changed as well
> >>> to your suggestion above. Note, that headroom is still needed, since
> >>> XDP can change the size of a packet, so the fixed headroom stated in
> >>> UMEM registration is not sufficient.
> >>>
> >>> This model is obviously more flexible, but then also a bit more
> >>> complex. E.g. a fixed-frame NIC (like ixgbe), what should the
> >>> behaviour be? Should the fill ring entry be used only for *one* frame,
> >>> or chopped up to multiple receive frames? Should it be configurable?
> >>> Driver specific?
> >>
> >> I think driver-specific makes most sense here. In case of fixed-frame
> >> NIC the driver shall chop the ring entry into multiple receive frames.
> >>
> >
> > Let's start there, keeping the configuration space small.
> >
> >>>
> >>> Also, validating the entries in the fill queue require more work
> >>> (compared to the current index variant). Currently, we only skip
> >>> invalid indicies. What should we do when say, you pass a memory window
> >>> that is too narrow (say 128B) but correct from a UMEM
> >>> perspective. Going this path, we need to add pretty hard constraints
> >>> so we don't end up it too complex code -- because then it'll be too
> >>> slow.
> >>
> >> If we stick with pages -- the only possible erroneous input will be
> >> 'page out of UMEM boundaries'. The validation will be essentially:
> >>
> >> if ((offset > umem->size) || (offset & (umem->page_size - 1))
> >>     fail
> >>
> >> The question is what shall be done if validation fails ? Would
> >> SEGFAULT be reasonable ? This is more or less equivalent to
> >> dereferencing invalid pointer.
> >>
> >
> > The current scheme is simply dropping that kernel skips the invalid
> > fill ring entry. SIGSEGV is an interesting idea!
> >
> >>>
> >>>> In current proposal you have a notion of 'frame ownership': 'owned by
> >>>> kernel' or 'owned by application'. The ownership is transferred by
> >>>> means of enqueueing frame index in UMEM 'fill' queue (from application
> >>>> to kernel) or in UMEM 'tx completion' queue (from kernel to
> >>>> application). If you decide to adopt 'page' approach this notion needs
> >>>> to be changed a bit. This is because in case of packet forwarding one
> >>>> and the same page can be used for RX (parts of it enqueued in HW 'free
> >>>> lists') and TX (forwarding of previously RXed packets).
> >>>>
> >>>> I propose to define 'ownership' as a right to manipulate the
> >>>> partitioning of the page into frames. Whenever application passes a
> >>>> page to the kernel via UMEM 'fill' queue -- the ownership is
> >>>> transferred to the kernel. The application can't allocate packets on
> >>>> this page until kernel is done with it, but it can change payload of
> >>>> RXed packets before forwarding them. The kernel can pass ownership
> >>>> back by means of 'end-of-page' in xdp_desc.flags.
> >>>>
> >>>
> >>> I like the end-of-page mechanism.
> >>>
> >>>> The pages are definitely supposed to be recycled sooner or later. Even
> >>>> if it's not part of kernel API and the housekeeping implementation
> >>>> resided completely in application I still would like to propose
> >>>> possible (hopefully, cost efficient) solution to that. The recycling
> >>>> could be achieved by keeping refcount on pages and recycling the page
> >>>> only when it's owned by application and refcount reaches 0.
> >>>>
> >>>> Whenever application transfers page ownership to the kernel the
> >>>> refcount shall be initialized to 0. With each incoming RX xdp_desc the
> >>>> corresponding page needs to be identified (xdp_desc.offset >>
> >>>> PAGE_SHIFT) and refcount incremented. When the packet gets freed the
> >>>> refcount shall be decremented. If packet is forwarded in TX xdp_desc
> >>>> -- the refcount gets decremented only on TX completion (again,
> >>>> tx_completion.desc >> PAGE_SHIFT). For packets originating from the
> >>>> application itself the payload buffers needs to be allocated from
> >>>> empty page owned by the application and refcount needs to be
> >>>> incremented as well.
> >>>>
> >>>
> >>> Strictly speaking, we're not enforcing correctness in the current
> >>> solution. If the userspace application passes index 1 mulitple times
> >>> to the fill ring, and at the same time send index 1, things will
> >>> break. So, with the existing solution the userspace application
> >>> *still* needs to track the frames. With this new model, the
> >>> tracking/refcounting will be more complex. That might be a concern.
> >>>
> >>> For the multi-pool NICs I think we can still just have one UMEM, and
> >>> let the driver decide where in which pool to place this certain chunk
> >>> of memory. Thoughts?
> >>
> >> Definitely agree with that. This is HW specific and exposing it to the
> >> application would only harm portability.
> >>
> >
> > Good stuff, we're on the same page then.
> >
> >>>
> >>> Now, how do we go forward? I think this is very important, and I will
> >>> hack a copy-mode version for this new API. I'm a bit worried that the
> >>> complexity/configuration space will grow and impact performance, but
> >>> let's see.
> >>>
> >>> To prove that the API works for the NICs you mentioned, we need an
> >>> actual zero-copy implementation for them. Do you think Linaro could
> >>> work on a zero-copy variant for any of the NICs above?
> >>>
> >>
> >> Linaro will definitely contribute zero-copy implementation for some
> >> ARM-based NICs with 'multi-pool' variation.
> >
> > Very nice!
> >
> >> Implementation of
> >> 'type-writer' variation is up to Chelsio/Netcope, we only try to come
> >> up with API which (most probably) will fit them as well.
> >>
> >
> > Let's hope we get an implementation from these vendors as well! :-)
> >
> >
> > Björn
> >
> >>>
> >>> Again thanks for bringing this up!
> >>> Björn
> >>>
> >>>
> >>>
> >>>> [1] https://en.wikipedia.org/wiki/Internet_Mix
> >>>>
> >>>> With best regards,
> >>>> Mykyta

^ permalink raw reply

* RE: [PATCH net] net: hns: Fix the process of adding broadcast addresses to tcam
From: Salil Mehta @ 2018-06-04 15:39 UTC (permalink / raw)
  To: David Miller
  Cc: Zhuangyuzeng (Yisen), lipeng (Y), mehta.salil@opnsrc.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Linuxarm,
	wangxi (M)
In-Reply-To: <20180604.113552.1136710379930167373.davem@davemloft.net>

Hi Dave,
Thanks for pointing out. Will fix them.

Best regards
Salil

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Monday, June 04, 2018 4:36 PM
> To: Salil Mehta <salil.mehta@huawei.com>
> Cc: Zhuangyuzeng (Yisen) <yisen.zhuang@huawei.com>; lipeng (Y)
> <lipeng321@huawei.com>; mehta.salil@opnsrc.net; netdev@vger.kernel.org;
> linux-kernel@vger.kernel.org; Linuxarm <linuxarm@huawei.com>; wangxi
> (M) <wangxi11@huawei.com>
> Subject: Re: [PATCH net] net: hns: Fix the process of adding broadcast
> addresses to tcam
> 
> From: Salil Mehta <salil.mehta@huawei.com>
> Date: Fri, 1 Jun 2018 18:00:11 +0100
> 
> > +		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
> > +			mc_mask, mac_entry->addr);
> 
> Please indent things properly when arguments span multiple lines.
> 
> The second line here should be precisely after the opennning
> parenthesis
> of the previous line.
> 
> > +		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
> > +			mc_mask, mac_entry->addr);
> 
> Likewise.

^ permalink raw reply

* Re: [RFC PATCH net-next] net: mvpp2: mvpp2_percpu_read_relaxed() can be static
From: David Miller @ 2018-06-04 15:36 UTC (permalink / raw)
  To: fengguang.wu
  Cc: maxime.chevallier, kbuild-all, netdev, antoine.tenart,
	thomas.petazzoni, ymarkman, stefanc, linux-kernel
In-Reply-To: <20180601194613.GA44770@xian>

From: kbuild test robot <fengguang.wu@intel.com>
Date: Sat, 2 Jun 2018 03:46:13 +0800

> 
> Fixes: db9d7d36eecc ("net: mvpp2: Split the PPv2 driver to a dedicated directory")
> Signed-off-by: kbuild test robot <fengguang.wu@intel.com>

Applied to net-next, thank you.

^ permalink raw reply

* Re: [PATCH net] net: hns: Fix the process of adding broadcast addresses to tcam
From: David Miller @ 2018-06-04 15:35 UTC (permalink / raw)
  To: salil.mehta
  Cc: yisen.zhuang, lipeng321, mehta.salil, netdev, linux-kernel,
	linuxarm, wangxi11
In-Reply-To: <20180601170011.22700-1-salil.mehta@huawei.com>

From: Salil Mehta <salil.mehta@huawei.com>
Date: Fri, 1 Jun 2018 18:00:11 +0100

> +		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
> +			mc_mask, mac_entry->addr);

Please indent things properly when arguments span multiple lines.

The second line here should be precisely after the opennning parenthesis
of the previous line.

> +		hns_dsaf_setup_mc_mask(dsaf_dev, mac_entry->in_port_num,
> +			mc_mask, mac_entry->addr);

Likewise.

^ permalink raw reply

* [PATCH net-next] net: sched: return error code when tcf proto is not found
From: Vlad Buslov @ 2018-06-04 15:32 UTC (permalink / raw)
  To: netdev; +Cc: davem, jhs, xiyou.wangcong, jiri, dan.carpenter, Vlad Buslov

If requested tcf proto is not found, get and del filter netlink protocol
handlers output error message to extack, but do not return actual error
code. Add check to return ENOENT when result of tp find function is NULL
pointer.

Fixes: c431f89b18a2 ("net: sched: split tc_ctl_tfilter into three
handlers")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

---
 net/sched/cls_api.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index c06585fb2dc6..cdc3c87c53e6 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1274,7 +1274,7 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 			       prio, false);
 	if (!tp || IS_ERR(tp)) {
 		NL_SET_ERR_MSG(extack, "Filter with specified priority/protocol not found");
-		err = PTR_ERR(tp);
+		err = tp ? PTR_ERR(tp) : -ENOENT;
 		goto errout;
 	} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind)) {
 		NL_SET_ERR_MSG(extack, "Specified filter kind does not match existing one");
@@ -1374,7 +1374,7 @@ static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 			       prio, false);
 	if (!tp || IS_ERR(tp)) {
 		NL_SET_ERR_MSG(extack, "Filter with specified priority/protocol not found");
-		err = PTR_ERR(tp);
+		err = tp ? PTR_ERR(tp) : -ENOENT;
 		goto errout;
 	} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind)) {
 		NL_SET_ERR_MSG(extack, "Specified filter kind does not match existing one");
-- 
2.7.5

^ permalink raw reply related

* Re: [PATCH] dt-bindings: net: ravb: Add support for r8a77990 SoC
From: Rob Herring @ 2018-06-04 15:32 UTC (permalink / raw)
  To: Sergei Shtylyov, Shimoda, Yoshihiro
  Cc: Simon Horman, David Miller, netdev,
	open list:MEDIA DRIVERS FOR RENESAS - FCP, Mark Rutland,
	devicetree
In-Reply-To: <19e43b97-9235-a752-7cf9-56a5e3bb281a@cogentembedded.com>

On Tue, May 15, 2018 at 3:43 AM, Sergei Shtylyov
<sergei.shtylyov@cogentembedded.com> wrote:
> On 5/13/2018 10:58 AM, Simon Horman wrote:
>
>>>> Add documentation for r8a77990 compatible string to renesas ravb device
>>>> tree bindings documentation.
>>>>
>>>> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
>>>
>>>
>>> I'm assuming this isn't targetted at one of my trees.  Just FYI.
>>
>>
>> Hi Dave,
>>
>> I think this is appropriate for net-next but if not I can take it.
>
>
>    There's also Rob, and IIRC he has taken alike patch recently...

Applied.

Rob

^ permalink raw reply

* Re: [PATCH net] net/packet: refine check for priv area size
From: David Miller @ 2018-06-04 15:32 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet
In-Reply-To: <20180601162302.32717-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Fri,  1 Jun 2018 09:23:02 -0700

> syzbot was able to trick af_packet again [1]

:-(

> Various commits tried to address the problem in the past,
> but failed to take into account V3 header size.
> 
> [1]
 ...
> Fixes: 2b6867c2ce76 ("net/packet: fix overflow in check for priv area size")
> Fixes: dc808110bb62 ("packet: handle too big packets for PACKET_V3")
> Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>

Applied and queued up for -stable, thanks Eric.

^ permalink raw reply

* Re: [PATCH] net: aquantia: make function aq_fw2x_get_mac_permanent static
From: David Miller @ 2018-06-04 15:31 UTC (permalink / raw)
  To: colin.king; +Cc: igor.russkikh, netdev, kernel-janitors, linux-kernel
In-Reply-To: <20180601152834.17958-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Fri,  1 Jun 2018 16:28:34 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> The function aq_fw2x_get_mac_permanent is local to the source and does
> not need to be in global scope, so make it static.
> 
> Cleans up sparse warning:
> warning: symbol 'aq_fw2x_get_mac_permanent' was not declared. Should it
> be static?
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied to net-next, thank you.

^ permalink raw reply

* Re: [PATCH net-next v5 00/11] Modify action API for implementing lockless actions
From: David Miller @ 2018-06-04 15:28 UTC (permalink / raw)
  To: vladbu
  Cc: netdev, jhs, xiyou.wangcong, jiri, pablo, kadlec, fw, ast, daniel,
	edumazet, keescook, marcelo.leitner, kliteyn
In-Reply-To: <1527866118-21312-1-git-send-email-vladbu@mellanox.com>

From: Vlad Buslov <vladbu@mellanox.com>
Date: Fri,  1 Jun 2018 18:15:07 +0300

> Currently, all netlink protocol handlers for updating rules, actions and
> qdiscs are protected with single global rtnl lock which removes any
> possibility for parallelism. This patch set is a first step to remove
> rtnl lock dependency from TC rules update path.

Hello Vlad.

I've read over this patch series a few times, and even though I can't
find any issues yet, I think this series is complex and invasive
enough that it should receive more time for review by other experts.

Therefore I'm deferring this to the next merge window in order to
accomodate that.

Thank you.

^ permalink raw reply

* [RFC PATCH 4/4] net: Add support for subordinate traffic classes to netdev_pick_tx
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan
In-Reply-To: <20180604150847.74754.8585.stgit@ahduyck-green-test.jf.intel.com>

This change makes it so that we can support the concept of subordinate
device traffic classes to the core networking code. In doing this we can
start pulling out the driver specific bits needed to support selecting a
queue based on an upper device.

The solution at is currently stands is only partially implemented. I have
the start of some XPS bits in here, but I would still need to allow for
configuration of the XPS maps on the queues reserved for the subordinate
devices. For now I am using the reference to the sb_dev XPS map as just a
way to skip the lookup of the lower device XPS map for now as that would
result in the wrong queue being picked.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   15 ++-----
 drivers/net/macvlan.c                         |   10 +----
 include/linux/netdevice.h                     |    4 +-
 net/core/dev.c                                |   55 +++++++++++++++----------
 4 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 05d6d48..c42498d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8218,20 +8218,18 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
 					      input, common, ring->queue_index);
 }
 
+#ifdef IXGBE_FCOE
 static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 			      void *accel_priv, select_queue_fallback_t fallback)
 {
-	struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-#ifdef IXGBE_FCOE
 	struct ixgbe_adapter *adapter;
 	struct ixgbe_ring_feature *f;
-#endif
 	int txq;
 
-	if (fwd_adapter) {
+	if (accel_priv) {
 		u8 tc = netdev_get_num_tc(dev) ?
 			netdev_get_prio_tc_map(dev, skb->priority) : 0;
-		struct net_device *vdev = fwd_adapter->netdev;
+		struct net_device *vdev = accel_priv;
 
 		txq = vdev->tc_to_txq[tc].offset;
 		txq += reciprocal_scale(skb_get_hash(skb),
@@ -8240,7 +8238,6 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 		return txq;
 	}
 
-#ifdef IXGBE_FCOE
 
 	/*
 	 * only execute the code below if protocol is FCoE
@@ -8267,11 +8264,9 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 		txq -= f->indices;
 
 	return txq + f->offset;
-#else
-	return fallback(dev, skb);
-#endif
 }
 
+#endif
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 			       struct xdp_frame *xdpf)
 {
@@ -10071,7 +10066,6 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 	.ndo_open		= ixgbe_open,
 	.ndo_stop		= ixgbe_close,
 	.ndo_start_xmit		= ixgbe_xmit_frame,
-	.ndo_select_queue	= ixgbe_select_queue,
 	.ndo_set_rx_mode	= ixgbe_set_rx_mode,
 	.ndo_validate_addr	= eth_validate_addr,
 	.ndo_set_mac_address	= ixgbe_set_mac,
@@ -10094,6 +10088,7 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 	.ndo_poll_controller	= ixgbe_netpoll,
 #endif
 #ifdef IXGBE_FCOE
+	.ndo_select_queue	= ixgbe_select_queue,
 	.ndo_fcoe_ddp_setup = ixgbe_fcoe_ddp_get,
 	.ndo_fcoe_ddp_target = ixgbe_fcoe_ddp_target,
 	.ndo_fcoe_ddp_done = ixgbe_fcoe_ddp_put,
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index adde8fc..401e1d1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -514,7 +514,6 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 	const struct macvlan_dev *vlan = netdev_priv(dev);
 	const struct macvlan_port *port = vlan->port;
 	const struct macvlan_dev *dest;
-	void *accel_priv = NULL;
 
 	if (vlan->mode == MACVLAN_MODE_BRIDGE) {
 		const struct ethhdr *eth = (void *)skb->data;
@@ -533,15 +532,10 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 			return NET_XMIT_SUCCESS;
 		}
 	}
-
-	/* For packets that are non-multicast and not bridged we will pass
-	 * the necessary information so that the lowerdev can distinguish
-	 * the source of the packets via the accel_priv value.
-	 */
-	accel_priv = vlan->accel_priv;
 xmit_world:
 	skb->dev = vlan->lowerdev;
-	return dev_queue_xmit_accel(skb, accel_priv);
+	return dev_queue_xmit_accel(skb,
+				    netdev_get_sb_channel(dev) ? dev : NULL);
 }
 
 static inline netdev_tx_t macvlan_netpoll_send_skb(struct macvlan_dev *vlan, struct sk_buff *skb)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bbc8045..df5a582 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2072,7 +2072,7 @@ static inline void netdev_for_each_tx_queue(struct net_device *dev,
 
 struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 				    struct sk_buff *skb,
-				    void *accel_priv);
+				    struct net_device *sb_dev);
 
 /* returns the headroom that the master device needs to take in account
  * when forwarding to this dev
@@ -2523,7 +2523,7 @@ struct net_device *__dev_get_by_flags(struct net *net, unsigned short flags,
 void dev_disable_lro(struct net_device *dev);
 int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *newskb);
 int dev_queue_xmit(struct sk_buff *skb);
-int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv);
+int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev);
 int dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
 int register_netdevice(struct net_device *dev);
 void unregister_netdevice_queue(struct net_device *dev, struct list_head *head);
diff --git a/net/core/dev.c b/net/core/dev.c
index eba255d..507e606 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2704,24 +2704,26 @@ void netif_device_attach(struct net_device *dev)
  * Returns a Tx hash based on the given packet descriptor a Tx queues' number
  * to be used as a distribution range.
  */
-static u16 skb_tx_hash(const struct net_device *dev, struct sk_buff *skb)
+static u16 skb_tx_hash(const struct net_device *dev,
+		       const struct net_device *sb_dev,
+		       struct sk_buff *skb)
 {
 	u32 hash;
 	u16 qoffset = 0;
 	u16 qcount = dev->real_num_tx_queues;
 
+	if (dev->num_tc) {
+		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+
+		qoffset = sb_dev->tc_to_txq[tc].offset;
+		qcount = sb_dev->tc_to_txq[tc].count;
+	}
+
 	if (skb_rx_queue_recorded(skb)) {
 		hash = skb_get_rx_queue(skb);
 		while (unlikely(hash >= qcount))
 			hash -= qcount;
-		return hash;
-	}
-
-	if (dev->num_tc) {
-		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
-
-		qoffset = dev->tc_to_txq[tc].offset;
-		qcount = dev->tc_to_txq[tc].count;
+		return hash + qoffset;
 	}
 
 	return (u16) reciprocal_scale(skb_get_hash(skb), qcount) + qoffset;
@@ -3483,7 +3485,9 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
 }
 #endif /* CONFIG_NET_EGRESS */
 
-static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+static inline int get_xps_queue(struct net_device *dev,
+				struct net_device *sb_dev,
+				struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
 	struct xps_dev_maps *dev_maps;
@@ -3491,7 +3495,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	int queue_index = -1;
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps);
+	dev_maps = rcu_dereference(sb_dev->xps_maps);
 	if (dev_maps) {
 		unsigned int tci = skb->sender_cpu - 1;
 
@@ -3519,17 +3523,18 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 #endif
 }
 
-static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
+static u16 ___netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
+			     struct net_device *sb_dev)
 {
 	struct sock *sk = skb->sk;
 	int queue_index = sk_tx_queue_get(sk);
 
 	if (queue_index < 0 || skb->ooo_okay ||
 	    queue_index >= dev->real_num_tx_queues) {
-		int new_index = get_xps_queue(dev, skb);
+		int new_index = get_xps_queue(dev, sb_dev, skb);
 
 		if (new_index < 0)
-			new_index = skb_tx_hash(dev, skb);
+			new_index = skb_tx_hash(dev, sb_dev, skb);
 
 		if (queue_index != new_index && sk &&
 		    sk_fullsock(sk) &&
@@ -3542,9 +3547,14 @@ static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
 	return queue_index;
 }
 
+static u16 __netdev_pick_tx(struct net_device *dev,
+			    struct sk_buff *skb)
+{
+	return ___netdev_pick_tx(dev, skb, NULL);
+}
 struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 				    struct sk_buff *skb,
-				    void *accel_priv)
+				    struct net_device *sb_dev)
 {
 	int queue_index = 0;
 
@@ -3559,10 +3569,11 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 		const struct net_device_ops *ops = dev->netdev_ops;
 
 		if (ops->ndo_select_queue)
-			queue_index = ops->ndo_select_queue(dev, skb, accel_priv,
+			queue_index = ops->ndo_select_queue(dev, skb, sb_dev,
 							    __netdev_pick_tx);
 		else
-			queue_index = __netdev_pick_tx(dev, skb);
+			queue_index = ___netdev_pick_tx(dev, skb,
+							sb_dev ? : dev);
 
 		queue_index = netdev_cap_txqueue(dev, queue_index);
 	}
@@ -3574,7 +3585,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 /**
  *	__dev_queue_xmit - transmit a buffer
  *	@skb: buffer to transmit
- *	@accel_priv: private data used for L2 forwarding offload
+ *	@sb_dev: suboordinate device used for L2 forwarding offload
  *
  *	Queue a buffer for transmission to a network device. The caller must
  *	have set the device and priority and built the buffer before calling
@@ -3597,7 +3608,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
  *      the BH enable code must have IRQs enabled so that it will not deadlock.
  *          --BLG
  */
-static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
+static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 {
 	struct net_device *dev = skb->dev;
 	struct netdev_queue *txq;
@@ -3636,7 +3647,7 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
 	else
 		skb_dst_force(skb);
 
-	txq = netdev_pick_tx(dev, skb, accel_priv);
+	txq = netdev_pick_tx(dev, skb, sb_dev);
 	q = rcu_dereference_bh(txq->qdisc);
 
 	trace_net_dev_queue(skb);
@@ -3710,9 +3721,9 @@ int dev_queue_xmit(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(dev_queue_xmit);
 
-int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv)
+int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev)
 {
-	return __dev_queue_xmit(skb, accel_priv);
+	return __dev_queue_xmit(skb, sb_dev);
 }
 EXPORT_SYMBOL(dev_queue_xmit_accel);
 

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox