Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC net-next PATCH 2/2] bpf: disallow XDP data_meta to overlap with xdp_frame area
From: Jesper Dangaard Brouer @ 2018-04-18 17:53 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev, brouer
In-Reply-To: <51af68d8-3c02-b828-12b5-f2dc73406a65@iogearbox.net>

On Wed, 18 Apr 2018 18:21:21 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 04/18/2018 02:10 PM, Jesper Dangaard Brouer wrote:
> > If combining xdp_adjust_head and xdp_adjust_meta, then it is possible
> > to make data_meta overlap with area used by xdp_frame.  And another
> > invocation of xdp_adjust_head can then clear that area, due to
> > clearing of xdp_frame area.
> > 
> > The easiest solution I found was to simply not allow
> > xdp_buff->data_meta to overlap with area used by xdp_frame.  
> 
> Thanks Jesper! Trying to answer both emails in one. :) More below.
> 
> > Fixes: 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse")
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  net/core/filter.c |   11 +++++++++++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 15e9b5477360..e3623e741181 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -2701,6 +2701,11 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
> >  		     data > xdp->data_end - ETH_HLEN))
> >  		return -EINVAL;
> >  
> > +	/* Disallow data_meta to use xdp_frame area */
> > +	if (metalen > 0 &&
> > +	    unlikely((data - metalen) < xdp_frame_end))
> > +		return -EINVAL;
> > +
> >  	/* Avoid info leak, when reusing area prev used by xdp_frame */
> >  	if (data < xdp_frame_end) {  
> 
> Effectively, when metalen > 0, then data_meta < data pointer, so above test
> on new data_meta might be better, but feels like a bit of a workaround to
> handle moving data pointer but disallowing moving data_meta pointer whereas
> both could be handled if we wanted to go that path.
> 
> >  		unsigned long clearlen = xdp_frame_end - data;
> > @@ -2734,6 +2739,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
> >  
> >  BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
> >  {
> > +	void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
> >  	void *meta = xdp->data_meta + offset;
> >  	unsigned long metalen = xdp->data - meta;
> >  
> > @@ -2742,6 +2748,11 @@ BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
> >  	if (unlikely(meta < xdp->data_hard_start ||
> >  		     meta > xdp->data))
> >  		return -EINVAL;
> > +
> > +	/* Disallow data_meta to use xdp_frame area */
> > +	if (unlikely(meta < xdp_frame_end))
> > +		return -EINVAL;  
> 
> (Ditto.)
> 
> >  	if (unlikely((metalen & (sizeof(__u32) - 1)) ||
> >  		     (metalen > 32)))
> >  		return -EACCES;  
> 
> The other, perhaps less invasive/complex option would be to just disallow
> moving anything into previous sizeof(struct xdp_frame) area. My original
> concern was that not all drivers use 256 bytes of headroom, e.g. afaik the
> i40e and ixgbe have around 192 bytes of headroom available, but that should
> actually still be plenty of space for encap + meta data, and potentially
> with meta data use I would expect that at best custom decap would be
> happening when pushing the packet up the stack. So might as well disallow
> going into that region and not worry about it. Thus, reverting e9e9545e10d3
> ("xdp: avoid leaking info stored in frame data on page reuse") and adding
> something like the below (uncompiled), should just do it:
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 3bb0cb9..ad98ddd 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2692,8 +2692,9 @@ static unsigned long xdp_get_metalen(const struct xdp_buff *xdp)
> 
>  BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
>  {
> +	void *frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
>  	unsigned long metalen = xdp_get_metalen(xdp);
> -	void *data_start = xdp->data_hard_start + metalen;
> +	void *data_start = frame_end + metalen;
>  	void *data = xdp->data + offset;
> 
>  	if (unlikely(data < data_start ||
> @@ -2719,13 +2720,13 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
> 
>  BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
>  {
> +	void *frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
>  	void *meta = xdp->data_meta + offset;
>  	unsigned long metalen = xdp->data - meta;
> 
>  	if (xdp_data_meta_unsupported(xdp))
>  		return -ENOTSUPP;
> -	if (unlikely(meta < xdp->data_hard_start ||
> -		     meta > xdp->data))
> +	if (unlikely(meta < frame_end || meta > xdp->data))
>  		return -EINVAL;
>  	if (unlikely((metalen & (sizeof(__u32) - 1)) ||
>  		     (metalen > 32)))

Okay, so you say just disallow using xdp_frame area in general.  It
would be simpler.  

The advantage it that we don't run into strange situations, where the
user/bpf_prog increased headroom too much, such that convert_to_xdp_frame()
fails and thus XDP_REDIRECT action fails.  (That will be confusing to
users to debug/troubleshoot).


> On top of that, we could even store a bool in struct xdp_rxq_info whether
> the driver actually is able to participate in resp. has the XDP_REDIRECT
> support and if not do something like:
> 
> void *frame_end = xdp->data_hard_start + xdp->rxq->has_redir ? sizeof(struct xdp_frame) : 0;
> 
> But the latter is merely a small optimization. Eventually we want all native XDP
> drivers to support it. Thoughts?

I would _really_ like see all drivers support XDP_REDIRECT, but to be
realistic, this is happening way too slow...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: Mikulas Patocka @ 2018-04-18 17:53 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, edumazet, bhutchings, netdev, linux-kernel, mst,
	jasowang, virtualization
In-Reply-To: <20180418.134651.2225112489265654270.davem@davemloft.net>

On Wed, 18 Apr 2018, David Miller wrote:

> From: Mikulas Patocka <mpatocka@redhat.com>
> Date: Wed, 18 Apr 2018 12:44:25 -0400 (EDT)
> 
> > The structure net_device is followed by arbitrary driver-specific data 
> > (accessible with the function netdev_priv). And for virtio-net, these 
> > driver-specific data must be in DMA memory.
> 
> And we are saying that this assumption is wrong and needs to be
> corrected.

So, try to find all the networking drivers that to DMA to the private 
area.

The problem here is that kvzalloc usually returns DMA-able area, but it 
may return non-DMA area rarely, if the memory is too fragmented. So, we 
are in a situation, where some networking drivers will randomly fail. Go 
and find them.

Mikulas

^ permalink raw reply

* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: Michael S. Tsirkin @ 2018-04-18 17:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mikulas Patocka, David S. Miller, Eric Dumazet, Joby Poriyath,
	Ben Hutchings, netdev, linux-kernel
In-Reply-To: <3e65977e-53cd-bf09-bc4b-0ce40e9091fe@gmail.com>

On Wed, Apr 18, 2018 at 09:05:54AM -0700, Eric Dumazet wrote:
> 
> 
> On 04/18/2018 07:34 AM, Mikulas Patocka wrote:
> > The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
> > fails (later patches change it to kvzalloc).
> > 
> > The problem with this is that if the vzalloc function is actually used, 
> > virtio_net doesn't work (because it expects that the extra memory should 
> > be accessible with DMA-API and memory allocated with vzalloc isn't).
> > 
> > This patch changes it back to kzalloc and adds a warning if the allocated
> > size is too large (the allocation is unreliable in this case).
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")
> > 
> > ---
> >  net/core/dev.c |    3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > Index: linux-2.6/net/core/dev.c
> > ===================================================================
> > --- linux-2.6.orig/net/core/dev.c	2018-04-16 21:08:36.000000000 +0200
> > +++ linux-2.6/net/core/dev.c	2018-04-18 16:24:43.000000000 +0200
> > @@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
> >  	/* ensure 32-byte alignment of whole construct */
> >  	alloc_size += NETDEV_ALIGN - 1;
> >  
> > -	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> > +	WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
> > +	p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> >  	if (!p)
> >  		return NULL;
> >  
> > 
> 
> Since when a net_device needs to be in DMA zone ???

It's likely that we are not the only device like this.
It would be better to find a way to find devices like this.

Imagine you want to pass some data to card.
Natural thing is to just put it in a variable and start DMA.
However DMA API disallows stack access nowdays,
so it's natural to put this within struct device.

See e.g.

	commit a725ee3e44e39dab1ec82cc745899a785d2a555e
	Author: Andy Lutomirski <luto@kernel.org>
	Date:   Mon Jul 18 15:34:49 2016 -0700

	    virtio-net: Remove more stack DMA



> I would rather fix virtio_net, this looks very suspect to me.


It's been done for years. I'm fine with changing virtio-net and
allocating DMA memory separately but I am not sure it's appropriate on
net.

And OTOH, shouldn't drivers avoid allocating such huge device structs?
Abusing vmalloc won't work well on 32 bit platforms.

> Each virtio_net should probably allocate the exact amount of DMA-memory it wants,
> instead of expecting core networking stack to have a huge chunk of DMA-memory for everything.

It's not a DMA memory at all (not a synchronous memory) and it is not
huge.  It's a small chunk of regular memory that is mapped for DMA for a
short while, then unmapped.

-- 
MST

^ permalink raw reply

* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: Mikulas Patocka @ 2018-04-18 17:55 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, edumazet, joby.poriyath, bhutchings, netdev,
	linux-kernel, mst, jasowang, virtualization
In-Reply-To: <20180418.134721.283381724211219025.davem@davemloft.net>



On Wed, 18 Apr 2018, David Miller wrote:

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 18 Apr 2018 09:51:25 -0700
> 
> > I suggest that virtio_net clearly identifies which part needs a specific allocation
> > and does its itself, instead of abusing the netdev_priv storage.
> > 
> > Ie use a pointer to a block of memory, allocated by virtio_net, for virtio_net.
> 
> +1

And what about the other networking drivers that may do the same thing?

Mikulas

^ permalink raw reply

* Re: [PATCH 0/8] ipconfig: NTP server support, bug fixes, documentation improvements
From: David Miller @ 2018-04-18 17:59 UTC (permalink / raw)
  To: chris; +Cc: netdev
In-Reply-To: <20180417205830.1812-1-chris@chrisn.me.uk>

From: Chris Novakovic <chris@chrisn.me.uk>
Date: Tue, 17 Apr 2018 21:58:22 +0100

>  - Patch #7 allows for NTP servers to be configured (manually on the
>    kernel command line or automatically via DHCP), enabling systems with
>    an NFS root filesystem to synchronise their clock before mounting
>    their root filesystem.

I think a plain file named /proc/net/ntp is quite confusing.  It doesn't
give any indication that it's a special file populated only by ipconfig
and not some general NTP thing the kernel is doing.

I would suggest creating a subdirectory like /proc/net/ipconfig or similar
to put such files.  Then the use and meaning is clear.

Also, as noted please remove patch 8 updating CREDITS, this really isn't
done any more.

^ permalink raw reply

* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: Michael S. Tsirkin @ 2018-04-18 18:00 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, netdev, linux-kernel, virtualization, edumazet,
	mpatocka, joby.poriyath, bhutchings
In-Reply-To: <20180418.134721.283381724211219025.davem@davemloft.net>

On Wed, Apr 18, 2018 at 01:47:21PM -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 18 Apr 2018 09:51:25 -0700
> 
> > I suggest that virtio_net clearly identifies which part needs a specific allocation
> > and does its itself, instead of abusing the netdev_priv storage.
> > 
> > Ie use a pointer to a block of memory, allocated by virtio_net, for virtio_net.
> 
> +1

I can do this, but just FYI it's all of 16 bytes which is only mapped
for DMA temporarily - and not all of it - a byte here, a byte there.
The amount of hoops one has to jump through just to get 1 byte from
device nowdays is annoying.

-- 
MST

^ permalink raw reply

* Re: [PATCH bpf-next v4 03/10] bpf: btf: Check members of struct/union
From: Martin KaFai Lau @ 2018-04-18 18:01 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180418102210.5fbbb05e@cakuba.netronome.com>

On Wed, Apr 18, 2018 at 10:22:10AM -0700, Jakub Kicinski wrote:
> On Tue, 17 Apr 2018 13:42:36 -0700, Martin KaFai Lau wrote:
> > This patch checks a few things of struct's members:
> > 
> > 1) It has a valid size (e.g. a "const void" is invalid)
> > 2) A member's size (+ its member's offset) does not exceed
> >    the containing struct's size.
> > 3) The member's offset satisfies the alignment requirement
> 
> Could we also introduce a requirement for members to have different
> names?  Maybe it's there but I missed it.  Would BTF with duplicated
> member names be considered valid?
It could check but I don't see BTF needs to check everything
that clang does.

> 
> > The above can only be done after the needs_resolve member's type
> > is resolved.  Hence, the above is done together in
> > btf_struct_resolve().
> > 
> > Each possible member's type (e.g. int, enum, modifier...) implements
> > the check_member() ops which will be called from btf_struct_resolve().
> > 
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > Acked-by: Alexei Starovoitov <ast@fb.com>

^ permalink raw reply

* [PATCH iproute2] iplink_geneve: correct size of message to avoid spurious errors
From: Jakub Kicinski @ 2018-04-18 18:06 UTC (permalink / raw)
  To: stephen; +Cc: dsahern, oss-drivers, netdev, Jakub Kicinski

Commit 6c4b672738ac ("iplink_geneve: Get rid of inet_get_addr()")
inadvertently changed the parameter to addattr_l() resulting in:

addattr_l ERROR: message exceeded bound of 4

when remote is specified.

Fixes: 6c4b672738ac ("iplink_geneve: Get rid of inet_get_addr()")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
---
 ip/iplink_geneve.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ip/iplink_geneve.c b/ip/iplink_geneve.c
index 1fcdd83acca9..26e70ff4db42 100644
--- a/ip/iplink_geneve.c
+++ b/ip/iplink_geneve.c
@@ -199,7 +199,7 @@ static int geneve_parse_opt(struct link_util *lu, int argc, char **argv,
 	if (is_addrtype_inet(&daddr)) {
 		int type = (daddr.family == AF_INET) ? IFLA_GENEVE_REMOTE :
 						       IFLA_GENEVE_REMOTE6;
-		addattr_l(n, sizeof(1024), type, daddr.data, daddr.bytelen);
+		addattr_l(n, 1024, type, daddr.data, daddr.bytelen);
 	}
 	if (!set_op || GENEVE_ATTRSET(attrs, IFLA_GENEVE_LABEL))
 		addattr32(n, 1024, IFLA_GENEVE_LABEL, label);
-- 
2.16.2

^ permalink raw reply related

* Re: [PATCH 0/8] ipconfig: NTP server support, bug fixes, documentation improvements
From: Chris Novakovic @ 2018-04-18 18:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20180418.135909.1521396191656904961.davem@davemloft.net>

On 18/04/2018 18:59, David Miller wrote:
> I think a plain file named /proc/net/ntp is quite confusing.  It doesn't
> give any indication that it's a special file populated only by ipconfig
> and not some general NTP thing the kernel is doing.
> 
> I would suggest creating a subdirectory like /proc/net/ipconfig or similar
> to put such files.  Then the use and meaning is clear.

That sounds more sensible --- I'll rework patch 7 and resend. (Would you
prefer me to post the entire series to the list again, or just that one
patch?)

> Also, as noted please remove patch 8 updating CREDITS, this really isn't
> done any more.

Will do.

^ permalink raw reply

* Re: [PATCH bpf] tools/bpf: fix test_sock and test_sock_addr.sh failure
From: Andrey Ignatov @ 2018-04-18 18:10 UTC (permalink / raw)
  To: Yonghong Song; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180418174912.2503994-1-yhs@fb.com>

The patch looks good to me.

Acked-by: Andrey Ignatov <rdna@fb.com>

Thanks for improving the tests, Yonghong!

Yonghong Song <yhs@fb.com> [Wed, 2018-04-18 10:49 -0700]:
> The bpf selftests test_sock and test_sock_addr.sh failed
> in my test machine. The failure looks like:
>     $ ./test_sock
>     Test case: bind4 load with invalid access: src_ip6 .. [PASS]
>     Test case: bind4 load with invalid access: mark .. [PASS]
>     Test case: bind6 load with invalid access: src_ip4 .. [PASS]
>     Test case: sock_create load with invalid access: src_port .. [PASS]
>     Test case: sock_create load w/o expected_attach_type (compat mode) .. [FAIL]
>     Test case: sock_create load w/ expected_attach_type .. [FAIL]
>     Test case: attach type mismatch bind4 vs bind6 .. [FAIL]
>     ...
>     Summary: 4 PASSED, 12 FAILED
>     $ ./test_sock_addr.sh
>     Wait for testing IPv4/IPv6 to become available .....
>     ERROR: Timeout waiting for test IP to become available.
> 
> In test_sock, bpf program loads failed due to hitting memlock limits.
> In test_sock_addr.sh, my test machine is a ipv6 only test box and using
> "ping" without specifying address family for an ipv6 address does not work.
> 
> This patch fixed the issue by including header bpf_rlimit.h in test_sock.c
> and test_sock_addr.c, and specifying address family for ping command.
> 
> Cc: Andrey Ignatov <rdna@fb.com>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  tools/testing/selftests/bpf/test_sock.c       | 1 +
>  tools/testing/selftests/bpf/test_sock_addr.c  | 1 +
>  tools/testing/selftests/bpf/test_sock_addr.sh | 4 ++--
>  3 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/test_sock.c b/tools/testing/selftests/bpf/test_sock.c
> index 73bb20c..f4d99fa 100644
> --- a/tools/testing/selftests/bpf/test_sock.c
> +++ b/tools/testing/selftests/bpf/test_sock.c
> @@ -13,6 +13,7 @@
>  #include <bpf/bpf.h>
>  
>  #include "cgroup_helpers.h"
> +#include "bpf_rlimit.h"
>  
>  #ifndef ARRAY_SIZE
>  # define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
> diff --git a/tools/testing/selftests/bpf/test_sock_addr.c b/tools/testing/selftests/bpf/test_sock_addr.c
> index d488f20..2950f80 100644
> --- a/tools/testing/selftests/bpf/test_sock_addr.c
> +++ b/tools/testing/selftests/bpf/test_sock_addr.c
> @@ -15,6 +15,7 @@
>  #include <bpf/libbpf.h>
>  
>  #include "cgroup_helpers.h"
> +#include "bpf_rlimit.h"
>  
>  #define CG_PATH	"/foo"
>  #define CONNECT4_PROG_PATH	"./connect4_prog.o"
> diff --git a/tools/testing/selftests/bpf/test_sock_addr.sh b/tools/testing/selftests/bpf/test_sock_addr.sh
> index c6e1dcf..9832a87 100755
> --- a/tools/testing/selftests/bpf/test_sock_addr.sh
> +++ b/tools/testing/selftests/bpf/test_sock_addr.sh
> @@ -4,7 +4,7 @@ set -eu
>  
>  ping_once()
>  {
> -	ping -q -c 1 -W 1 ${1%%/*} >/dev/null 2>&1
> +	ping -${1} -q -c 1 -W 1 ${2%%/*} >/dev/null 2>&1
>  }
>  
>  wait_for_ip()
> @@ -13,7 +13,7 @@ wait_for_ip()
>  	echo -n "Wait for testing IPv4/IPv6 to become available "
>  	for _i in $(seq ${MAX_PING_TRIES}); do
>  		echo -n "."
> -		if ping_once ${TEST_IPv4} && ping_once ${TEST_IPv6}; then
> +		if ping_once 4 ${TEST_IPv4} && ping_once 6 ${TEST_IPv6}; then
>  			echo " OK"
>  			return
>  		fi
> -- 
> 2.9.5
> 

-- 
Andrey Ignatov

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: Alexander Duyck @ 2018-04-18 18:12 UTC (permalink / raw)
  To: David Miller
  Cc: Sowmini Varadhan, Willem de Bruijn, Samudrala, Sridhar, Netdev,
	Willem de Bruijn
In-Reply-To: <20180418.132808.1710130437020293308.davem@davemloft.net>

On Wed, Apr 18, 2018 at 10:28 AM, David Miller <davem@davemloft.net> wrote:
> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Date: Wed, 18 Apr 2018 08:31:03 -0400
>
>> However, I share Sridhar's concerns about the very fundamental change
>> to UDP message boundary semantics here.  There is actually no such thing
>> as a "segment" in udp, so in general this feature makes me a little
>> uneasy.  Well behaved udp applications should already be sending mtu
>> sized datagrams. And the not-so-well-behaved ones are probably relying
>> on IP fragmentation/reassembly to take care of datagram boundary semantics
>> for them?
>>
>> As Sridhar points out, the feature is not really "negotiated" - one side
>> unilaterally sets the option. If the receiver is a classic/POSIX UDP
>> implementation, it will have no way of knowing that message boundaries
>> have been re-adjusted at the sender.
>
> There are no "semantics".
>
> What ends up on the wire is the same before the kernel/app changes as
> afterwards.
>
> The only difference is that instead of the application doing N - 1
> sendmsg() calls with mtu sized writes, it's giving everything all at
> once and asking the kernel to segment.
>
> It even gives the application control over the size of the packets,
> which I think is completely prudent in this situation.

My only concern with the patch set is verifying what mitigations are
in case so that we aren't trying to set an MSS size that results in a
frame larger than MTU. I'm still digging through the code and trying
to grok it, but I figured I might just put the question out there to
may my reviewing easier.

Also any plans for HW offload support for this? I vaguely recall that
the igb and ixgbe parts had support for something like this in
hardware. I would have to double check to see what exactly is
supported.

Thanks.

- Alex

^ permalink raw reply

* Re: [PATCH 0/8] ipconfig: NTP server support, bug fixes, documentation improvements
From: Chris Novakovic @ 2018-04-18 18:12 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <e01df1b3-2780-dffb-9c10-b5be8792f126@chrisn.me.uk>

On 18/04/2018 19:06, Chris Novakovic wrote:
> On 18/04/2018 18:59, David Miller wrote:
>> I think a plain file named /proc/net/ntp is quite confusing.  It doesn't
>> give any indication that it's a special file populated only by ipconfig
>> and not some general NTP thing the kernel is doing.
>>
>> I would suggest creating a subdirectory like /proc/net/ipconfig or similar
>> to put such files.  Then the use and meaning is clear.
> 
> That sounds more sensible --- I'll rework patch 7 and resend. (Would you
> prefer me to post the entire series to the list again, or just that one
> patch?)

Sorry, I just realised how stupid that sounded: I can't remove patch 8
unless I repost the series... :)

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: Willem de Bruijn @ 2018-04-18 18:12 UTC (permalink / raw)
  To: David Miller; +Cc: Network Development, Willem de Bruijn
In-Reply-To: <20180418.135010.273233202471255234.davem@davemloft.net>

On Wed, Apr 18, 2018 at 1:50 PM, David Miller <davem@davemloft.net> wrote:
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Tue, 17 Apr 2018 16:00:50 -0400
>
>> Segmentation offload reduces cycles/byte for large packets by
>> amortizing the cost of protocol stack traversal.
>>
>> This patchset implements GSO for UDP.
>
> This looks great.
>
> And as mentioned in other emails, the interface looks good too by letting
> the app choose the segmentation point.
>
> It's a real shame we lose checksumming with MSG_MORE, perhaps we can find
> a way to lift that restriction?

Actually, yes, I should be able to relax that constraint with
segmentation.

It is there in case a corked packet may grow to the point of
having to be fragmented. But segmentation already ensures
that its datagrams always fit in MTU.

Should be simple refinement to

            !(flags & MSG_MORE) &&

in __ip_append_data.

^ permalink raw reply

* [PATCH net-next] team: account for oper state
From: George Wilkie @ 2018-04-18 10:29 UTC (permalink / raw)
  To: Jiri Pirko, netdev

Account for operational state when determining port linkup state,
as per Documentation/networking/operstates.txt.

Signed-off-by: George Wilkie <gwilkie@vyatta.att-mail.com>
---
 drivers/net/team/team.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index a6c6ce19eeee..231264a05e55 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -2918,7 +2918,8 @@ static int team_device_event(struct notifier_block *unused,
 	case NETDEV_CHANGE:
 		if (netif_running(port->dev))
 			team_port_change_check(port,
-					       !!netif_carrier_ok(port->dev));
+					       !!(netif_carrier_ok(port->dev) &&
+						  netif_oper_up(port->dev)));
 		break;
 	case NETDEV_UNREGISTER:
 		team_del_slave(port->team->dev, dev);
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH net 5/5] nfp: remove false positive offloads in flower vxlan
From: Or Gerlitz @ 2018-04-18 18:18 UTC (permalink / raw)
  To: John Hurley; +Cc: Jakub Kicinski, Linux Netdev List, oss-drivers, Simon Horman
In-Reply-To: <CAK+XE=ks_z4xxx9gmV-HjpOH0KBz+qPZ4c+kB990-W9nBazj+A@mail.gmail.com>

On Wed, Apr 18, 2018 at 3:31 PM, John Hurley <john.hurley@netronome.com> wrote:
> On Wed, Apr 18, 2018 at 8:43 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>> On Fri, Nov 17, 2017 at 4:06 AM, Jakub Kicinski
>> <jakub.kicinski@netronome.com> wrote:
>>> From: John Hurley <john.hurley@netronome.com>
>>>
>>> Pass information to the match offload on whether or not the repr is the
>>> ingress or egress dev. Only accept tunnel matches if repr is the egress dev.
>>>
>>> This means rules such as the following are successfully offloaded:
>>> tc .. add dev vxlan0 .. enc_dst_port 4789 .. action redirect dev nfp_p0
>>>
>>> While rules such as the following are rejected:
>>> tc .. add dev nfp_p0 .. enc_dst_port 4789 .. action redirect dev vxlan0
>>
>> cool
>>
>>
>>> Also reject non tunnel flows that are offloaded to an egress dev.
>>> Non tunnel matches assume that the offload dev is the ingress port and
>>> offload a match accordingly.
>>
>> not following on the "Also" here, see below
>>
>>
>>> diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
>>> index a0193e0c24a0..f5d73b83dcc2 100644
>>> --- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
>>> +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
>>> @@ -131,7 +131,8 @@ static bool nfp_flower_check_higher_than_mac(struct tc_cls_flower_offload *f)
>>>
>>>  static int
>>>  nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls,
>>> -                               struct tc_cls_flower_offload *flow)
>>> +                               struct tc_cls_flower_offload *flow,
>>> +                               bool egress)
>>>  {
>>>         struct flow_dissector_key_basic *mask_basic = NULL;
>>>         struct flow_dissector_key_basic *key_basic = NULL;
>>> @@ -167,6 +168,9 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls,
>>>                         skb_flow_dissector_target(flow->dissector,
>>>                                                   FLOW_DISSECTOR_KEY_ENC_CONTROL,
>>>                                                   flow->key);
>>> +               if (!egress)
>>> +                       return -EOPNOTSUPP;
>>> +
>>>                 if (mask_enc_ctl->addr_type != 0xffff ||
>>>                     enc_ctl->addr_type != FLOW_DISSECTOR_KEY_IPV4_ADDRS)
>>>                         return -EOPNOTSUPP;
>>> @@ -194,6 +198,9 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls,
>>>
>>>                 key_layer |= NFP_FLOWER_LAYER_VXLAN;
>>>                 key_size += sizeof(struct nfp_flower_vxlan);
>>> +       } else if (egress) {
>>> +               /* Reject non tunnel matches offloaded to egress repr. */
>>> +               return -EOPNOTSUPP;
>>>         }
>>
>> with these two hunks we get: egress <- IFF -> encap match, right?
>>
>> (1) we can't offload the egress way if there isn't matching on encap headers
>> (2) we can't go the matching on encap headers way if we are not egress
>>
>
> yes, this is correct.
> With the block code and egdev offload, we do not have access to the
> ingress netdev when doing an offload.
> We need to use the encap headers (especially the enc_port) to
> distinguish the type of tunnel used and, therefore, require that the
> encap matches be present before offloading.
>
>> what other cases are rejected by this logic?
>>
>
> Yes, some other cases may be rejected (like veth mentioned below).

my claim is that the veth case I mentioned below will not be rejected
if it has the matching on encap headers, and a wrong rule will be set
into hw, agree?

> However, this is better than allowing rules to be incorrectly
> offloaded (as could have happened before these changes).

> Currently, we are looking at offloading flows on other ingress devices
> such as bonds so this will require a change to the driver code here.

for the ingress side, Jiri suggested that the slave devices (uplink reps),
will be just getting all the rules set on the bond, so I am not sure what
problem you see here... for decap it will be still vxlan --> vf rep and your
egress logic will allow it.

> IMO, the cleanest solution will also require tc core changes to either
> avoid egdev offload or to have access to the ingress netdev of a rule.

>> e.g If we add a rule with SW device (veth. tap) being the ingress, and
>> HW device (vf rep)
>> being the egress while not using skip_sw (just no flags == both) we
>> get the TC stack
>> go along the egdev callback from the vf rep hw device and add an
>> (uplink --> vf rep) rule
>> which will not be rejected if there is matching on tunnel headers, it
>> will also not be rejected
>> by some driver logic as the one we discussed to identify and ignore
>> rules that are attempted to being added twice.
>>
>> Or.

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: Willem de Bruijn @ 2018-04-18 18:22 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Sowmini Varadhan, Samudrala, Sridhar, Netdev,
	Willem de Bruijn
In-Reply-To: <CAKgT0UcefCpG2j7RQN8aUUgu2bud_RTFs_s+nFPY2GhKgE8L+A@mail.gmail.com>

On Wed, Apr 18, 2018 at 2:12 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Apr 18, 2018 at 10:28 AM, David Miller <davem@davemloft.net> wrote:
>> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> Date: Wed, 18 Apr 2018 08:31:03 -0400
>>
>>> However, I share Sridhar's concerns about the very fundamental change
>>> to UDP message boundary semantics here.  There is actually no such thing
>>> as a "segment" in udp, so in general this feature makes me a little
>>> uneasy.  Well behaved udp applications should already be sending mtu
>>> sized datagrams. And the not-so-well-behaved ones are probably relying
>>> on IP fragmentation/reassembly to take care of datagram boundary semantics
>>> for them?
>>>
>>> As Sridhar points out, the feature is not really "negotiated" - one side
>>> unilaterally sets the option. If the receiver is a classic/POSIX UDP
>>> implementation, it will have no way of knowing that message boundaries
>>> have been re-adjusted at the sender.
>>
>> There are no "semantics".
>>
>> What ends up on the wire is the same before the kernel/app changes as
>> afterwards.
>>
>> The only difference is that instead of the application doing N - 1
>> sendmsg() calls with mtu sized writes, it's giving everything all at
>> once and asking the kernel to segment.
>>
>> It even gives the application control over the size of the packets,
>> which I think is completely prudent in this situation.
>
> My only concern with the patch set is verifying what mitigations are
> in case so that we aren't trying to set an MSS size that results in a
> frame larger than MTU. I'm still digging through the code and trying
> to grok it, but I figured I might just put the question out there to
> may my reviewing easier.

This is checked in udp_send_skb in

                const int hlen = skb_network_header_len(skb) +
                                 sizeof(struct udphdr);

                if (hlen + cork->gso_size > cork->fragsize)
                        return -EINVAL;

At this point cork->fragsize will have been set in ip_setup_cork
based on device or path mtu:

        cork->fragsize = ip_sk_use_pmtu(sk) ?
                         dst_mtu(&rt->dst) : rt->dst.dev->mtu;

The feature bypasses the MTU sanity checks in __ip_append_data
by setting local variable mtu to a network layer max

        mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;

but the above should perform the same check, only later. The
check in udp_send_skb is simpler as it does not have to deal
with the case of fragmentation.

> Also any plans for HW offload support for this? I vaguely recall that
> the igb and ixgbe parts had support for something like this in
> hardware. I would have to double check to see what exactly is
> supported.

I hadn't given that much thought until the request yesterday to
expose the NETIF_F_GSO_UDP_L4 flag through ethtool. By
virtue of having only a single fixed segmentation length, it
appears reasonably straightforward to offload.

^ permalink raw reply

* Re: [PATCH bpf-next v4 03/10] bpf: btf: Check members of struct/union
From: Jakub Kicinski @ 2018-04-18 18:30 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180418180101.6tdt5hr3ned25gu3@kafai-mbp>

On Wed, 18 Apr 2018 11:01:15 -0700, Martin KaFai Lau wrote:
> On Wed, Apr 18, 2018 at 10:22:10AM -0700, Jakub Kicinski wrote:
> > On Tue, 17 Apr 2018 13:42:36 -0700, Martin KaFai Lau wrote:  
> > > This patch checks a few things of struct's members:
> > > 
> > > 1) It has a valid size (e.g. a "const void" is invalid)
> > > 2) A member's size (+ its member's offset) does not exceed
> > >    the containing struct's size.
> > > 3) The member's offset satisfies the alignment requirement  
> > 
> > Could we also introduce a requirement for members to have different
> > names?  Maybe it's there but I missed it.  Would BTF with duplicated
> > member names be considered valid?  
>
> It could check but I don't see BTF needs to check everything
> that clang does.

Agreed, I don't think correct tooling should ever generate duplicated
members.  Should the BTF forbid it then?  It could help catch bugs and
avoid problems.  I was thinking about JSON where duplicated field
names will result in invalid JSON potentially leading to issues in
user space stacks...

^ permalink raw reply

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Samudrala, Sridhar @ 2018-04-18 18:43 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh
In-Reply-To: <20180418092515.GB1989@nanopsycho>

On 4/18/2018 2:25 AM, Jiri Pirko wrote:
> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>>> Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>>>> This provides a generic interface for paravirtual drivers to listen
>>>> for netdev register/unregister/link change events from pci ethernet
>>>> devices with the same MAC and takeover their datapath. The notifier and
>>>> event handling code is based on the existing netvsc implementation.
>>>>
>>>> It exposes 2 sets of interfaces to the paravirtual drivers.
>>>> 1. existing netvsc driver that uses 2 netdev model. In this model, no
>>>> master netdev is created. The paravirtual driver registers each bypass
>>>> instance along with a set of ops to manage the slave events.
>>>>       bypass_master_register()
>>>>       bypass_master_unregister()
>>>> 2. new virtio_net based solution that uses 3 netdev model. In this model,
>>>> the bypass module provides interfaces to create/destroy additional master
>>>> netdev and all the slave events are managed internally.
>>>>        bypass_master_create()
>>>>        bypass_master_destroy()
>>>>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>> ---
>>>> include/linux/netdevice.h |  14 +
>>>> include/net/bypass.h      |  96 ++++++
>>>> net/Kconfig               |  18 +
>>>> net/core/Makefile         |   1 +
>>>> net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>>>> 5 files changed, 973 insertions(+)
>>>> create mode 100644 include/net/bypass.h
>>>> create mode 100644 net/core/bypass.c
>>>>
>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>> index cf44503ea81a..587293728f70 100644
>>>> --- a/include/linux/netdevice.h
>>>> +++ b/include/linux/netdevice.h
>>>> @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>>>> 	IFF_PHONY_HEADROOM		= 1<<24,
>>>> 	IFF_MACSEC			= 1<<25,
>>>> 	IFF_NO_RX_HANDLER		= 1<<26,
>>>> +	IFF_BYPASS			= 1 << 27,
>>>> +	IFF_BYPASS_SLAVE		= 1 << 28,
>>> I wonder, why you don't follow the existing coding style... Also, please
>>> add these to into the comment above.
>> To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> to the existing coding style to be consistent.
> Please do.
>
>
>>>
>>>> };
>>>>
>>>> #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>>>> @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>>>> #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>>>> #define IFF_MACSEC			IFF_MACSEC
>>>> #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>>>> +#define IFF_BYPASS			IFF_BYPASS
>>>> +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>>>>
>>>> /**
>>>>    *	struct net_device - The DEVICE structure.
>>>> @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>>>> 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>>>> }
>>>>
>>>> +static inline bool netif_is_bypass_master(const struct net_device *dev)
>>>> +{
>>>> +	return dev->priv_flags & IFF_BYPASS;
>>>> +}
>>>> +
>>>> +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>>>> +{
>>>> +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>>>> +}
>>>> +
>>>> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>>>> static inline void netif_keep_dst(struct net_device *dev)
>>>> {
>>>> diff --git a/include/net/bypass.h b/include/net/bypass.h
>>>> new file mode 100644
>>>> index 000000000000..86b02cb894cf
>>>> --- /dev/null
>>>> +++ b/include/net/bypass.h
>>>> @@ -0,0 +1,96 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/* Copyright (c) 2018, Intel Corporation. */
>>>> +
>>>> +#ifndef _NET_BYPASS_H
>>>> +#define _NET_BYPASS_H
>>>> +
>>>> +#include <linux/netdevice.h>
>>>> +
>>>> +struct bypass_ops {
>>>> +	int (*slave_pre_register)(struct net_device *slave_netdev,
>>>> +				  struct net_device *bypass_netdev);
>>>> +	int (*slave_join)(struct net_device *slave_netdev,
>>>> +			  struct net_device *bypass_netdev);
>>>> +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>>>> +				    struct net_device *bypass_netdev);
>>>> +	int (*slave_release)(struct net_device *slave_netdev,
>>>> +			     struct net_device *bypass_netdev);
>>>> +	int (*slave_link_change)(struct net_device *slave_netdev,
>>>> +				 struct net_device *bypass_netdev);
>>>> +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>>>> +};
>>>> +
>>>> +struct bypass_master {
>>>> +	struct list_head list;
>>>> +	struct net_device __rcu *bypass_netdev;
>>>> +	struct bypass_ops __rcu *ops;
>>>> +};
>>>> +
>>>> +/* bypass state */
>>>> +struct bypass_info {
>>>> +	/* passthru netdev with same MAC */
>>>> +	struct net_device __rcu *active_netdev;
>>> You still use "active"/"backup" names which is highly misleading as
>>> it has completely different meaning that in bond for example.
>>> I noted that in my previous review already. Please change it.
>> I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>> matches with the BACKUP feature bit we are adding to virtio_net.
> I think that "backup" is also misleading. Both "active" and "backup"
> mean a *state* of slaves. This should be named differently.
>
>
>
>> With regards to alternate names for 'active', you suggested 'stolen', but i
>> am not too happy with it.
>> netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
> No. The netdev could be any netdevice. It does not have to be a "VF".
> I think "stolen" is quite appropriate since it describes the modus
> operandi. The bypass master steals some netdevice according to some
> match.
>
> But I don't insist on "stolen". Just sounds right.

We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
'backup' name is consistent.

The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.

Will look for any suggestions in the next day or two. If i don't get any, i will go
with 'stolen'

<snip>


> +
> +static struct net_device *bypass_master_get_bymac(u8 *mac,
> +						  struct bypass_ops **ops)
> +{
> +	struct bypass_master *bypass_master;
> +	struct net_device *bypass_netdev;
> +
> +	spin_lock(&bypass_lock);
> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>>> As I wrote the last time, you don't need this list, spinlock.
>>> You can do just something like:
>>>           for_each_net(net) {
>>>                   for_each_netdev(net, dev) {
>>> 			if (netif_is_bypass_master(dev)) {
>> This function returns the upper netdev as well as the ops associated
>> with that netdev.
>> bypass_master_list is a list of 'struct bypass_master' that associates
> Well, can't you have it in netdev priv?

We cannot do this for 2-netdev model as there is no bypass_netdev created.

>
>
>> 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> NULL for 3-netdev model.
> I see :(
>
>
>>
>>>
>>>
>>>
>>>> +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>>>> +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>>>> +			*ops = rcu_dereference(bypass_master->ops);
>>> I don't see how rcu_dereference is ok here.
>>> 1) I don't see rcu_read_lock taken
>>> 2) Looks like bypass_master->ops has the same value across the whole
>>>      existence.
>> We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> Yes. ops doesn't change.
> If it does not change, you can just access it directly.
>
>
>>>
>>>> +			spin_unlock(&bypass_lock);
>>>> +			return bypass_netdev;
>>>> +		}
>>>> +	}
>>>> +	spin_unlock(&bypass_lock);
>>>> +	return NULL;
>>>> +}
>>>> +
>>>> +static int bypass_slave_register(struct net_device *slave_netdev)
>>>> +{
>>>> +	struct net_device *bypass_netdev;
>>>> +	struct bypass_ops *bypass_ops;
>>>> +	int ret, orig_mtu;
>>>> +
>>>> +	ASSERT_RTNL();
>>>> +
>>>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>>>> +						&bypass_ops);
>>> For master, could you use word "master" in the variables so it is clear?
>>> Also, "dev" is fine instead of "netdev".
>>> Something like "bpmaster_dev"
>> bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
> I was trying to point out, that "bypass_netdev" represents a "master"
> netdev, yet it does not say master. That is why I suggested
> "bpmaster_dev"
>
>
>> I can change all _netdev suffixes to _dev to make the names shorter.
> ok.
>
>
>>
>>>
>>>> +	if (!bypass_netdev)
>>>> +		goto done;
>>>> +
>>>> +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>>>> +					bypass_ops);
>>>> +	if (ret != 0)
>>> 	Just "if (ret)" will do. You have this on more places.
>> OK.
>>
>>
>>>
>>>> +		goto done;
>>>> +
>>>> +	ret = netdev_rx_handler_register(slave_netdev,
>>>> +					 bypass_ops ? bypass_ops->handle_frame :
>>>> +					 bypass_handle_frame, bypass_netdev);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>>>> +			   ret);
>>>> +		goto done;
>>>> +	}
>>>> +
>>>> +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>>>> +			   bypass_netdev->name, ret);
>>>> +		goto upper_link_failed;
>>>> +	}
>>>> +
>>>> +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>>>> +
>>>> +	if (netif_running(bypass_netdev)) {
>>>> +		ret = dev_open(slave_netdev);
>>>> +		if (ret && (ret != -EBUSY)) {
>>>> +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>>>> +				   slave_netdev->name, ret);
>>>> +			goto err_interface_up;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	/* Align MTU of slave with master */
>>>> +	orig_mtu = slave_netdev->mtu;
>>>> +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>>>> +			   slave_netdev->name, bypass_netdev->mtu);
>>>> +		goto err_set_mtu;
>>>> +	}
>>>> +
>>>> +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>>>> +	if (ret != 0)
>>>> +		goto err_join;
>>>> +
>>>> +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>>>> +
>>>> +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>>>> +		    slave_netdev->name);
>>>> +
>>>> +	goto done;
>>>> +
>>>> +err_join:
>>>> +	dev_set_mtu(slave_netdev, orig_mtu);
>>>> +err_set_mtu:
>>>> +	dev_close(slave_netdev);
>>>> +err_interface_up:
>>>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>>>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>>>> +upper_link_failed:
>>>> +	netdev_rx_handler_unregister(slave_netdev);
>>>> +done:
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>>>> +				       struct net_device *bypass_netdev,
>>>> +				       struct bypass_ops *bypass_ops)
>>>> +{
>>>> +	struct net_device *backup_netdev, *active_netdev;
>>>> +	struct bypass_info *bi;
>>>> +
>>>> +	if (bypass_ops) {
>>>> +		if (!bypass_ops->slave_pre_unregister)
>>>> +			return -EINVAL;
>>>> +
>>>> +		return bypass_ops->slave_pre_unregister(slave_netdev,
>>>> +							bypass_netdev);
>>>> +	}
>>>> +
>>>> +	bi = netdev_priv(bypass_netdev);
>>>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>>>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>>>> +
>>>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>>>> +		return -EINVAL;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static int bypass_slave_release(struct net_device *slave_netdev,
>>>> +				struct net_device *bypass_netdev,
>>>> +				struct bypass_ops *bypass_ops)
>>>> +{
>>>> +	struct net_device *backup_netdev, *active_netdev;
>>>> +	struct bypass_info *bi;
>>>> +
>>>> +	if (bypass_ops) {
>>>> +		if (!bypass_ops->slave_release)
>>>> +			return -EINVAL;
>>> I think it would be good to make the API to the driver more strict and
>>> have a separate set of ops for "active" and "backup" netdevices.
>>> That should stop people thinking about extending this to more slaves in
>>> the future.
>> We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> 'active' slave.
> I'm very well aware of that. I just thought that explicit ops for the
> two slaves would make this more clear.
>
>
>>
>>>
>>>
>>>> +
>>>> +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>>>> +	}
>>>> +
>>>> +	bi = netdev_priv(bypass_netdev);
>>>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>>>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>>>> +
>>>> +	if (slave_netdev == backup_netdev) {
>>>> +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>>>> +	} else {
>>>> +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>>>> +		if (backup_netdev) {
>>>> +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>>>> +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	dev_put(slave_netdev);
>>>> +
>>>> +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>>>> +		    slave_netdev->name);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +int bypass_slave_unregister(struct net_device *slave_netdev)
>>>> +{
>>>> +	struct net_device *bypass_netdev;
>>>> +	struct bypass_ops *bypass_ops;
>>>> +	int ret;
>>>> +
>>>> +	if (!netif_is_bypass_slave(slave_netdev))
>>>> +		goto done;
>>>> +
>>>> +	ASSERT_RTNL();
>>>> +
>>>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>>>> +						&bypass_ops);
>>>> +	if (!bypass_netdev)
>>>> +		goto done;
>>>> +
>>>> +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>>>> +					  bypass_ops);
>>>> +	if (ret != 0)
>>>> +		goto done;
>>>> +
>>>> +	netdev_rx_handler_unregister(slave_netdev);
>>>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>>>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>>>> +
>>>> +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>>>> +
>>>> +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>>>> +		    slave_netdev->name);
>>>> +
>>>> +done:
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>>>> +
>>>> +static bool bypass_xmit_ready(struct net_device *dev)
>>>> +{
>>>> +	return netif_running(dev) && netif_carrier_ok(dev);
>>>> +}
>>>> +
>>>> +static int bypass_slave_link_change(struct net_device *slave_netdev)
>>>> +{
>>>> +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>>>> +	struct bypass_ops *bypass_ops;
>>>> +	struct bypass_info *bi;
>>>> +
>>>> +	if (!netif_is_bypass_slave(slave_netdev))
>>>> +		goto done;
>>>> +
>>>> +	ASSERT_RTNL();
>>>> +
>>>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>>>> +						&bypass_ops);
>>>> +	if (!bypass_netdev)
>>>> +		goto done;
>>>> +
>>>> +	if (bypass_ops) {
>>>> +		if (!bypass_ops->slave_link_change)
>>>> +			goto done;
>>>> +
>>>> +		return bypass_ops->slave_link_change(slave_netdev,
>>>> +						     bypass_netdev);
>>>> +	}
>>>> +
>>>> +	if (!netif_running(bypass_netdev))
>>>> +		return 0;
>>>> +
>>>> +	bi = netdev_priv(bypass_netdev);
>>>> +
>>>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>>>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>>>> +
>>>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>>>> +		goto done;
>>> You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>>> above is enough.
>> I think we need this check to not allow events from a slave that is not
>> attached to this master but has the same MAC.
> Why do we need such events? Seems wrong to me.

We want to avoid events from a netdev that is mis-configured with the same MAC as
a bypass setup.

>   Consider:
>
> bp1      bp2
> a1 b1    a2 b2
>
>
> a1 and a2 have the same mac and bp1 and bp2 have the same mac.

We should not have 2 bypass configs with the same MAC.
I need to add a check in the bypass_master_register() to prevent this.

The above check is to avoid cases where we have
bp1(a1, b1) with mac1
and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.

> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
> the order of creation.
> Let's say it will return bp1. Then when we have event for a2, the
> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>
>
> You cannot use bypass_master_get_bymac() here.
>
>
>
>>>
>>>> +
>>>> +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>>>> +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>>>> +		netif_carrier_on(bypass_netdev);
>>>> +		netif_tx_wake_all_queues(bypass_netdev);
>>>> +	} else {
>>>> +		netif_carrier_off(bypass_netdev);
>>>> +		netif_tx_stop_all_queues(bypass_netdev);
>>>> +	}
>>>> +
>>>> +done:
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static bool bypass_validate_event_dev(struct net_device *dev)
>>>> +{
>>>> +	/* Skip parent events */
>>>> +	if (netif_is_bypass_master(dev))
>>>> +		return false;
>>>> +
>>>> +	/* Avoid non-Ethernet type devices */
>>>> +	if (dev->type != ARPHRD_ETHER)
>>>> +		return false;
>>>> +
>>>> +	/* Avoid Vlan dev with same MAC registering as VF */
>>>> +	if (is_vlan_dev(dev))
>>>> +		return false;
>>>> +
>>>> +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>>>> +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>>> Yeah, this is certainly incorrect. One thing is, you should be using the
>>> helpers netif_is_bond_master().
>>> But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>>>
>>> You need to do it not by blacklisting, but with whitelisting. You need
>>> to whitelist VF devices. My port flavours patchset might help with this.
>> May be i can use netdev_has_lower_dev() helper to make sure that the slave
> I don't see such function in the code.

It is netdev_has_any_lower_dev(). I need to export it.

>
>
>> device is not an upper dev.
>> Can you point to your port flavours patchset? Is it upstream?
> I sent rfc couple of weeks ago:
> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation

^ permalink raw reply

* [PATCH net-next] net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
From: Eric Dumazet @ 2018-04-18 18:43 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet

After working on IP defragmentation lately, I found that some large
packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
zero paddings on the last (small) fragment.

While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
to CHECKSUM_NONE, forcing a full csum validation, even if all prior
fragments had CHECKSUM_COMPLETE set.

We can instead compute the checksum of the part we are trimming,
usually smaller than the part we keep.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/skbuff.h |  5 ++---
 net/core/skbuff.c      | 14 ++++++++++++++
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9065477ed255a48f7e01b8a28ea6321cce9127f5..d274059529eb5216d041dfdcad4a564a623c8ea0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3131,6 +3131,7 @@ static inline void *skb_push_rcsum(struct sk_buff *skb, unsigned int len)
 	return skb->data;
 }
 
+int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len);
 /**
  *	pskb_trim_rcsum - trim received skb and update checksum
  *	@skb: buffer to trim
@@ -3144,9 +3145,7 @@ static inline int pskb_trim_rcsum(struct sk_buff *skb, unsigned int len)
 {
 	if (likely(len >= skb->len))
 		return 0;
-	if (skb->ip_summed == CHECKSUM_COMPLETE)
-		skb->ip_summed = CHECKSUM_NONE;
-	return __pskb_trim(skb, len);
+	return pskb_trim_rcsum_slow(skb, len);
 }
 
 static inline int __skb_trim_rcsum(struct sk_buff *skb, unsigned int len)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 345b51837ca80bb709bfffe04d58eedbba0b9907..ff49e352deea1d07049f5c5ac425fbcdb4fd96a8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1839,6 +1839,20 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
 }
 EXPORT_SYMBOL(___pskb_trim);
 
+/* Note : use pskb_trim_rcsum() instead of calling this directly
+ */
+int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len)
+{
+	if (skb->ip_summed == CHECKSUM_COMPLETE) {
+		int delta = skb->len - len;
+
+		skb->csum = csum_sub(skb->csum,
+				     skb_checksum(skb, len, delta, 0));
+	}
+	return __pskb_trim(skb, len);
+}
+EXPORT_SYMBOL(pskb_trim_rcsum_slow);
+
 /**
  *	__pskb_pull_tail - advance tail of skb header
  *	@skb: buffer to reallocate
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* Re: [RFC PATCH V1 01/12] audit: add container id
From: Stefan Berger @ 2018-04-18 18:45 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, mszeredi, luto, jlayton, carlos,
	viro, dhowells, simo, trondmy, eparis, serge, ebiederm, madzcar
In-Reply-To: <20180316035837.ddnqvbyrbp3fdk7e@madcap2.tricolour.ca>

On 03/15/2018 11:58 PM, Richard Guy Briggs wrote:
> On 2018-03-15 16:27, Stefan Berger wrote:
>> On 03/01/2018 02:41 PM, Richard Guy Briggs wrote:
>>> Implement the proc fs write to set the audit container ID of a process,
>>> emitting an AUDIT_CONTAINER record to document the event.
>>>
>>> This is a write from the container orchestrator task to a proc entry of
>>> the form /proc/PID/containerid where PID is the process ID of the newly
>>> created task that is to become the first task in a container, or an
>>> additional task added to a container.
>>>
>>> The write expects up to a u64 value (unset: 18446744073709551615).
>>>
>>> This will produce a record such as this:
>>> type=UNKNOWN[1333] msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
>>>
>>> The "op" field indicates an initial set.  The "pid" to "ses" fields are
>>> the orchestrator while the "opid" field is the object's PID, the process
>>> being "contained".  Old and new container ID values are given in the
>>> "contid" fields, while res indicates its success.
>>>
>>> It is not permitted to self-set, unset or re-set the container ID.  A
>>> child inherits its parent's container ID, but then can be set only once
>>> after.
>>>
>>> See: https://github.com/linux-audit/audit-kernel/issues/32
>>>
>>>
>>>    /* audit_rule_data supports filter rules with both integer and string
>>>     * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
>>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>>> index 4e0a4ac..0ee1e59 100644
>>> --- a/kernel/auditsc.c
>>> +++ b/kernel/auditsc.c
>>> @@ -2073,6 +2073,92 @@ int audit_set_loginuid(kuid_t loginuid)
>>>    	return rc;
>>>    }
>>>
>>> +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
>>> +{
>>> +	struct task_struct *parent;
>>> +	u64 pcontainerid, ccontainerid;
>>> +	pid_t ppid;
>>> +
>>> +	/* Don't allow to set our own containerid */
>>> +	if (current == task)
>>> +		return -EPERM;
>>> +	/* Don't allow the containerid to be unset */
>>> +	if (!cid_valid(containerid))
>>> +		return -EINVAL;
>>> +	/* if we don't have caps, reject */
>>> +	if (!capable(CAP_AUDIT_CONTROL))
>>> +		return -EPERM;
>>> +	/* if containerid is unset, allow */
>>> +	if (!audit_containerid_set(task))
>>> +		return 0;
>> I am wondering whether there should be a check for the target process that
>> will receive the containerid to not have CAP_SYS_ADMIN that would otherwise
>> allow it to arbitrarily unshare()/clone() and leave the set of namespaces
>> that may make up the container whose containerid we assign here?
> This is a reasonable question.  This has been debated and I understood
> the conclusion was that without a clear definition of a "container", the
> task still remains in that container that just now has more
> sub-namespaces (in the case of hierarchical namespaces), we don't want
> to restrict it in such a way and that allows it to create nested
> containers.  I see setns being more problematic if it could switch to
> another existing namespace that was set up by the orchestrator for a
> different container.  The coming v2 patchset acknowledges this situation
> with the network namespace being potentially shared by multiple
> containers.

Are you going to post v2 soon? We would like to build on top of it for 
IMA namespacing and auditing inside of IMA namespaces.

    Stefan

^ permalink raw reply

* [Patch net] llc: hold llc_sap before release_sock()
From: Cong Wang @ 2018-04-18 18:51 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang

syzbot reported we still access llc->sap in llc_backlog_rcv()
after it is freed in llc_sap_remove_socket():

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
 llc_conn_ac_send_sabme_cmd_p_set_x+0x3a8/0x460 net/llc/llc_c_ac.c:785
 llc_exec_conn_trans_actions net/llc/llc_conn.c:475 [inline]
 llc_conn_service net/llc/llc_conn.c:400 [inline]
 llc_conn_state_process+0x4e1/0x13a0 net/llc/llc_conn.c:75
 llc_backlog_rcv+0x195/0x1e0 net/llc/llc_conn.c:891
 sk_backlog_rcv include/net/sock.h:909 [inline]
 __release_sock+0x12f/0x3a0 net/core/sock.c:2335
 release_sock+0xa4/0x2b0 net/core/sock.c:2850
 llc_ui_release+0xc8/0x220 net/llc/af_llc.c:204

llc->sap is refcount'ed and llc_sap_remove_socket() is paired
with llc_sap_add_socket(). This can be amended by holding its refcount
before llc_sap_remove_socket() and releasing it after release_sock().

Reported-by: <syzbot+6e181fc95081c2cf9051@syzkaller.appspotmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/llc/af_llc.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
index 01dcc0823d1f..6d29b2b94e84 100644
--- a/net/llc/af_llc.c
+++ b/net/llc/af_llc.c
@@ -189,6 +189,7 @@ static int llc_ui_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
 	struct llc_sock *llc;
+	struct llc_sap *sap;
 
 	if (unlikely(sk == NULL))
 		goto out;
@@ -199,9 +200,15 @@ static int llc_ui_release(struct socket *sock)
 		llc->laddr.lsap, llc->daddr.lsap);
 	if (!llc_send_disc(sk))
 		llc_ui_wait_for_disc(sk, sk->sk_rcvtimeo);
+	sap = llc->sap;
+	/* Hold this for release_sock(), so that llc_backlog_rcv() could still
+	 * use it.
+	 */
+	llc_sap_hold(sap);
 	if (!sock_flag(sk, SOCK_ZAPPED))
 		llc_sap_remove_socket(llc->sap, sk);
 	release_sock(sk);
+	llc_sap_put(sap);
 	if (llc->dev)
 		dev_put(llc->dev);
 	sock_put(sk);
-- 
2.13.0

^ permalink raw reply related

* sendmmsg flags userspace ABI change in kernel 4.6
From: Florian Weimer @ 2018-04-18 19:03 UTC (permalink / raw)
  To: linux-api, netdev, linux-sctp; +Cc: linux-kernel, Tom Herbert, mjw

Since this commit:

commit 28a94d8fb35b3a75b802f368ae6f4a9f6b0d435a
Author: Tom Herbert <tom@herbertland.com>
Date:   Mon Mar 7 14:11:02 2016 -0800

    net: Allow MSG_EOR in each msghdr of sendmmsg

    This patch allows setting MSG_EOR in each individual msghdr passed
    in sendmmsg. This allows a sendmmsg to send multiple messages when
    using SOCK_SEQPACKET.

    Signed-off-by: Tom Herbert <tom@herbertland.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

the msg_flags argument in individual msghdr arguments is longer
completely ignored for SOCK_SEQPACKET sockets.  msg_flags was and is
still documented as ignored for sendmsg(2), so by analogy for
sendmmsg(2) as well.

It seems that valgrind does not know about this yet, and due to
limited use of SCTP, this userspace ABI change has not been noticed so
far.

What are the plans in this area?  Will other kinds of sockets start
using the msghdr flags for sending?

A fully backwards-compatibility way to achieve this would be to
specify that you have to pass a new flag to sendmmsg (MSG_PERHDR?), in
its flags argument, to activate the per-msghdr flags.

The glibc DNS stub resolver relies on the previously documented
behavior, and I wonder how widely we should backport the change:

  https://sourceware.org/bugzilla/show_bug.cgi?id=23037

If the MSG_PERHDR route will be taken, we can skip this work, and
valgrind can flag uninitialized bits in msg_flags only if MSG_PERHDR
is passed.  (I believe it would be difficult for valgrind to look at
the socket type to determine whether undefined bits need reporting.)

^ permalink raw reply

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Jiri Pirko @ 2018-04-18 19:13 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh
In-Reply-To: <dd0d53f0-f5da-cb7d-f8f6-d0c8245eb3cf@intel.com>

Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
>On 4/18/2018 2:25 AM, Jiri Pirko wrote:
>> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > This provides a generic interface for paravirtual drivers to listen
>> > > > for netdev register/unregister/link change events from pci ethernet
>> > > > devices with the same MAC and takeover their datapath. The notifier and
>> > > > event handling code is based on the existing netvsc implementation.
>> > > > 
>> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
>> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> > > > master netdev is created. The paravirtual driver registers each bypass
>> > > > instance along with a set of ops to manage the slave events.
>> > > >       bypass_master_register()
>> > > >       bypass_master_unregister()
>> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> > > > the bypass module provides interfaces to create/destroy additional master
>> > > > netdev and all the slave events are managed internally.
>> > > >        bypass_master_create()
>> > > >        bypass_master_destroy()
>> > > > 
>> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > > > ---
>> > > > include/linux/netdevice.h |  14 +
>> > > > include/net/bypass.h      |  96 ++++++
>> > > > net/Kconfig               |  18 +
>> > > > net/core/Makefile         |   1 +
>> > > > net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> > > > 5 files changed, 973 insertions(+)
>> > > > create mode 100644 include/net/bypass.h
>> > > > create mode 100644 net/core/bypass.c
>> > > > 
>> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> > > > index cf44503ea81a..587293728f70 100644
>> > > > --- a/include/linux/netdevice.h
>> > > > +++ b/include/linux/netdevice.h
>> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> > > > 	IFF_PHONY_HEADROOM		= 1<<24,
>> > > > 	IFF_MACSEC			= 1<<25,
>> > > > 	IFF_NO_RX_HANDLER		= 1<<26,
>> > > > +	IFF_BYPASS			= 1 << 27,
>> > > > +	IFF_BYPASS_SLAVE		= 1 << 28,
>> > > I wonder, why you don't follow the existing coding style... Also, please
>> > > add these to into the comment above.
>> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> > to the existing coding style to be consistent.
>> Please do.
>> 
>> 
>> > > 
>> > > > };
>> > > > 
>> > > > #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> > > > #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>> > > > #define IFF_MACSEC			IFF_MACSEC
>> > > > #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> > > > +#define IFF_BYPASS			IFF_BYPASS
>> > > > +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>> > > > 
>> > > > /**
>> > > >    *	struct net_device - The DEVICE structure.
>> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> > > > 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> > > > }
>> > > > 
>> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> > > > +{
>> > > > +	return dev->priv_flags & IFF_BYPASS;
>> > > > +}
>> > > > +
>> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> > > > +{
>> > > > +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>> > > > +}
>> > > > +
>> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> > > > static inline void netif_keep_dst(struct net_device *dev)
>> > > > {
>> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
>> > > > new file mode 100644
>> > > > index 000000000000..86b02cb894cf
>> > > > --- /dev/null
>> > > > +++ b/include/net/bypass.h
>> > > > @@ -0,0 +1,96 @@
>> > > > +// SPDX-License-Identifier: GPL-2.0
>> > > > +/* Copyright (c) 2018, Intel Corporation. */
>> > > > +
>> > > > +#ifndef _NET_BYPASS_H
>> > > > +#define _NET_BYPASS_H
>> > > > +
>> > > > +#include <linux/netdevice.h>
>> > > > +
>> > > > +struct bypass_ops {
>> > > > +	int (*slave_pre_register)(struct net_device *slave_netdev,
>> > > > +				  struct net_device *bypass_netdev);
>> > > > +	int (*slave_join)(struct net_device *slave_netdev,
>> > > > +			  struct net_device *bypass_netdev);
>> > > > +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> > > > +				    struct net_device *bypass_netdev);
>> > > > +	int (*slave_release)(struct net_device *slave_netdev,
>> > > > +			     struct net_device *bypass_netdev);
>> > > > +	int (*slave_link_change)(struct net_device *slave_netdev,
>> > > > +				 struct net_device *bypass_netdev);
>> > > > +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> > > > +};
>> > > > +
>> > > > +struct bypass_master {
>> > > > +	struct list_head list;
>> > > > +	struct net_device __rcu *bypass_netdev;
>> > > > +	struct bypass_ops __rcu *ops;
>> > > > +};
>> > > > +
>> > > > +/* bypass state */
>> > > > +struct bypass_info {
>> > > > +	/* passthru netdev with same MAC */
>> > > > +	struct net_device __rcu *active_netdev;
>> > > You still use "active"/"backup" names which is highly misleading as
>> > > it has completely different meaning that in bond for example.
>> > > I noted that in my previous review already. Please change it.
>> > I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>> > matches with the BACKUP feature bit we are adding to virtio_net.
>> I think that "backup" is also misleading. Both "active" and "backup"
>> mean a *state* of slaves. This should be named differently.
>> 
>> 
>> 
>> > With regards to alternate names for 'active', you suggested 'stolen', but i
>> > am not too happy with it.
>> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> No. The netdev could be any netdevice. It does not have to be a "VF".
>> I think "stolen" is quite appropriate since it describes the modus
>> operandi. The bypass master steals some netdevice according to some
>> match.
>> 
>> But I don't insist on "stolen". Just sounds right.
>
>We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>'backup' name is consistent.

It perhaps makes sense from the view of virtio device. However, as I
described couple of times, for master/slave device the name "backup" is
highly misleading.


>
>The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>
>Will look for any suggestions in the next day or two. If i don't get any, i will go
>with 'stolen'
>
><snip>
>
>
>> +
>> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> +						  struct bypass_ops **ops)
>> +{
>> +	struct bypass_master *bypass_master;
>> +	struct net_device *bypass_netdev;
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> > > As I wrote the last time, you don't need this list, spinlock.
>> > > You can do just something like:
>> > >           for_each_net(net) {
>> > >                   for_each_netdev(net, dev) {
>> > > 			if (netif_is_bypass_master(dev)) {
>> > This function returns the upper netdev as well as the ops associated
>> > with that netdev.
>> > bypass_master_list is a list of 'struct bypass_master' that associates
>> Well, can't you have it in netdev priv?
>
>We cannot do this for 2-netdev model as there is no bypass_netdev created.

Howcome? You have no master? I don't understand..



>
>> 
>> 
>> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> > NULL for 3-netdev model.
>> I see :(
>> 
>> 
>> > 
>> > > 
>> > > 
>> > > 
>> > > > +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> > > > +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> > > > +			*ops = rcu_dereference(bypass_master->ops);
>> > > I don't see how rcu_dereference is ok here.
>> > > 1) I don't see rcu_read_lock taken
>> > > 2) Looks like bypass_master->ops has the same value across the whole
>> > >      existence.
>> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> > Yes. ops doesn't change.
>> If it does not change, you can just access it directly.
>> 
>> 
>> > > 
>> > > > +			spin_unlock(&bypass_lock);
>> > > > +			return bypass_netdev;
>> > > > +		}
>> > > > +	}
>> > > > +	spin_unlock(&bypass_lock);
>> > > > +	return NULL;
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
>> > > > +{
>> > > > +	struct net_device *bypass_netdev;
>> > > > +	struct bypass_ops *bypass_ops;
>> > > > +	int ret, orig_mtu;
>> > > > +
>> > > > +	ASSERT_RTNL();
>> > > > +
>> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > > > +						&bypass_ops);
>> > > For master, could you use word "master" in the variables so it is clear?
>> > > Also, "dev" is fine instead of "netdev".
>> > > Something like "bpmaster_dev"
>> > bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
>> I was trying to point out, that "bypass_netdev" represents a "master"
>> netdev, yet it does not say master. That is why I suggested
>> "bpmaster_dev"
>> 
>> 
>> > I can change all _netdev suffixes to _dev to make the names shorter.
>> ok.
>> 
>> 
>> > 
>> > > 
>> > > > +	if (!bypass_netdev)
>> > > > +		goto done;
>> > > > +
>> > > > +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> > > > +					bypass_ops);
>> > > > +	if (ret != 0)
>> > > 	Just "if (ret)" will do. You have this on more places.
>> > OK.
>> > 
>> > 
>> > > 
>> > > > +		goto done;
>> > > > +
>> > > > +	ret = netdev_rx_handler_register(slave_netdev,
>> > > > +					 bypass_ops ? bypass_ops->handle_frame :
>> > > > +					 bypass_handle_frame, bypass_netdev);
>> > > > +	if (ret != 0) {
>> > > > +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> > > > +			   ret);
>> > > > +		goto done;
>> > > > +	}
>> > > > +
>> > > > +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> > > > +	if (ret != 0) {
>> > > > +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> > > > +			   bypass_netdev->name, ret);
>> > > > +		goto upper_link_failed;
>> > > > +	}
>> > > > +
>> > > > +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> > > > +
>> > > > +	if (netif_running(bypass_netdev)) {
>> > > > +		ret = dev_open(slave_netdev);
>> > > > +		if (ret && (ret != -EBUSY)) {
>> > > > +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> > > > +				   slave_netdev->name, ret);
>> > > > +			goto err_interface_up;
>> > > > +		}
>> > > > +	}
>> > > > +
>> > > > +	/* Align MTU of slave with master */
>> > > > +	orig_mtu = slave_netdev->mtu;
>> > > > +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> > > > +	if (ret != 0) {
>> > > > +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> > > > +			   slave_netdev->name, bypass_netdev->mtu);
>> > > > +		goto err_set_mtu;
>> > > > +	}
>> > > > +
>> > > > +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> > > > +	if (ret != 0)
>> > > > +		goto err_join;
>> > > > +
>> > > > +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> > > > +
>> > > > +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> > > > +		    slave_netdev->name);
>> > > > +
>> > > > +	goto done;
>> > > > +
>> > > > +err_join:
>> > > > +	dev_set_mtu(slave_netdev, orig_mtu);
>> > > > +err_set_mtu:
>> > > > +	dev_close(slave_netdev);
>> > > > +err_interface_up:
>> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> > > > +upper_link_failed:
>> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> > > > +done:
>> > > > +	return NOTIFY_DONE;
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> > > > +				       struct net_device *bypass_netdev,
>> > > > +				       struct bypass_ops *bypass_ops)
>> > > > +{
>> > > > +	struct net_device *backup_netdev, *active_netdev;
>> > > > +	struct bypass_info *bi;
>> > > > +
>> > > > +	if (bypass_ops) {
>> > > > +		if (!bypass_ops->slave_pre_unregister)
>> > > > +			return -EINVAL;
>> > > > +
>> > > > +		return bypass_ops->slave_pre_unregister(slave_netdev,
>> > > > +							bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +	bi = netdev_priv(bypass_netdev);
>> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > > > +
>> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> > > > +		return -EINVAL;
>> > > > +
>> > > > +	return 0;
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
>> > > > +				struct net_device *bypass_netdev,
>> > > > +				struct bypass_ops *bypass_ops)
>> > > > +{
>> > > > +	struct net_device *backup_netdev, *active_netdev;
>> > > > +	struct bypass_info *bi;
>> > > > +
>> > > > +	if (bypass_ops) {
>> > > > +		if (!bypass_ops->slave_release)
>> > > > +			return -EINVAL;
>> > > I think it would be good to make the API to the driver more strict and
>> > > have a separate set of ops for "active" and "backup" netdevices.
>> > > That should stop people thinking about extending this to more slaves in
>> > > the future.
>> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> > 'active' slave.
>> I'm very well aware of that. I just thought that explicit ops for the
>> two slaves would make this more clear.
>> 
>> 
>> > 
>> > > 
>> > > 
>> > > > +
>> > > > +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +	bi = netdev_priv(bypass_netdev);
>> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > > > +
>> > > > +	if (slave_netdev == backup_netdev) {
>> > > > +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> > > > +	} else {
>> > > > +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>> > > > +		if (backup_netdev) {
>> > > > +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> > > > +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> > > > +		}
>> > > > +	}
>> > > > +
>> > > > +	dev_put(slave_netdev);
>> > > > +
>> > > > +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> > > > +		    slave_netdev->name);
>> > > > +
>> > > > +	return 0;
>> > > > +}
>> > > > +
>> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
>> > > > +{
>> > > > +	struct net_device *bypass_netdev;
>> > > > +	struct bypass_ops *bypass_ops;
>> > > > +	int ret;
>> > > > +
>> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> > > > +		goto done;
>> > > > +
>> > > > +	ASSERT_RTNL();
>> > > > +
>> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > > > +						&bypass_ops);
>> > > > +	if (!bypass_netdev)
>> > > > +		goto done;
>> > > > +
>> > > > +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> > > > +					  bypass_ops);
>> > > > +	if (ret != 0)
>> > > > +		goto done;
>> > > > +
>> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> > > > +
>> > > > +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> > > > +
>> > > > +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> > > > +		    slave_netdev->name);
>> > > > +
>> > > > +done:
>> > > > +	return NOTIFY_DONE;
>> > > > +}
>> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> > > > +
>> > > > +static bool bypass_xmit_ready(struct net_device *dev)
>> > > > +{
>> > > > +	return netif_running(dev) && netif_carrier_ok(dev);
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> > > > +{
>> > > > +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> > > > +	struct bypass_ops *bypass_ops;
>> > > > +	struct bypass_info *bi;
>> > > > +
>> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> > > > +		goto done;
>> > > > +
>> > > > +	ASSERT_RTNL();
>> > > > +
>> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > > > +						&bypass_ops);
>> > > > +	if (!bypass_netdev)
>> > > > +		goto done;
>> > > > +
>> > > > +	if (bypass_ops) {
>> > > > +		if (!bypass_ops->slave_link_change)
>> > > > +			goto done;
>> > > > +
>> > > > +		return bypass_ops->slave_link_change(slave_netdev,
>> > > > +						     bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +	if (!netif_running(bypass_netdev))
>> > > > +		return 0;
>> > > > +
>> > > > +	bi = netdev_priv(bypass_netdev);
>> > > > +
>> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > > > +
>> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> > > > +		goto done;
>> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>> > > above is enough.
>> > I think we need this check to not allow events from a slave that is not
>> > attached to this master but has the same MAC.
>> Why do we need such events? Seems wrong to me.
>
>We want to avoid events from a netdev that is mis-configured with the same MAC as
>a bypass setup.
>
>>   Consider:
>> 
>> bp1      bp2
>> a1 b1    a2 b2
>> 
>> 
>> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
>
>We should not have 2 bypass configs with the same MAC.
>I need to add a check in the bypass_master_register() to prevent this.

Mac can change, you would have to check in change as well. Feels odd
thought. 


>
>The above check is to avoid cases where we have
>bp1(a1, b1) with mac1
>and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
>
>> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
>> the order of creation.
>> Let's say it will return bp1. Then when we have event for a2, the
>> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>> 
>> 
>> You cannot use bypass_master_get_bymac() here.
>> 
>> 
>> 
>> > > 
>> > > > +
>> > > > +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> > > > +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> > > > +		netif_carrier_on(bypass_netdev);
>> > > > +		netif_tx_wake_all_queues(bypass_netdev);
>> > > > +	} else {
>> > > > +		netif_carrier_off(bypass_netdev);
>> > > > +		netif_tx_stop_all_queues(bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +done:
>> > > > +	return NOTIFY_DONE;
>> > > > +}
>> > > > +
>> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
>> > > > +{
>> > > > +	/* Skip parent events */
>> > > > +	if (netif_is_bypass_master(dev))
>> > > > +		return false;
>> > > > +
>> > > > +	/* Avoid non-Ethernet type devices */
>> > > > +	if (dev->type != ARPHRD_ETHER)
>> > > > +		return false;
>> > > > +
>> > > > +	/* Avoid Vlan dev with same MAC registering as VF */
>> > > > +	if (is_vlan_dev(dev))
>> > > > +		return false;
>> > > > +
>> > > > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> > > > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> > > Yeah, this is certainly incorrect. One thing is, you should be using the
>> > > helpers netif_is_bond_master().
>> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> > > 
>> > > You need to do it not by blacklisting, but with whitelisting. You need
>> > > to whitelist VF devices. My port flavours patchset might help with this.
>> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
>> I don't see such function in the code.
>
>It is netdev_has_any_lower_dev(). I need to export it.

Come on, you cannot use that. That would allow bonding without slaves,
but the slaves could be added later on.

What exactly you are trying to achieve by this?


>
>> 
>> 
>> > device is not an upper dev.
>> > Can you point to your port flavours patchset? Is it upstream?
>> I sent rfc couple of weeks ago:
>> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
>
>
>

^ permalink raw reply

* Re: [PATCH net-next] team: account for oper state
From: Jiri Pirko @ 2018-04-18 19:17 UTC (permalink / raw)
  To: George Wilkie; +Cc: netdev
In-Reply-To: <20180418153312.24h7mvle6sy2dv25@debian9.gwilkie>

Wed, Apr 18, 2018 at 05:33:12PM CEST, gwilkie@vyatta.att-mail.com wrote:
>On Wed, Apr 18, 2018 at 04:58:22PM +0200, Jiri Pirko wrote:
>> Wed, Apr 18, 2018 at 03:35:49PM CEST, gwilkie@vyatta.att-mail.com wrote:
>> >On Wed, Apr 18, 2018 at 02:56:44PM +0200, Jiri Pirko wrote:
>> >> Wed, Apr 18, 2018 at 12:29:50PM CEST, gwilkie@vyatta.att-mail.com wrote:
>> >> >Account for operational state when determining port linkup state,
>> >> >as per Documentation/networking/operstates.txt.
>> >> 
>> >> Could you please point me to the exact place in the document where this
>> >> is suggested?
>> >> 
>> >
>> >Various places cover it I think.
>> >
>> >In 1. Introduction:
>> >"interface is not usable just because the admin enabled it"
>> >"userspace must be granted the possibility to
>> >influence operational state"
>> >
>> >In 4. Setting from userspace:
>> >"the userspace application can set IFLA_OPERSTATE
>> >to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set
>> >netif_carrier_off() or netif_dormant_on()"
>> >
>> >We have a use case where we want to set the oper state of the team ports based
>> >on whether they are actually usable or not (as opposed to just admin up).
>> 
>> Are you running a supplicant there or what is the use-case?
>> 
>
>We are using tun/tap interfaces for the team ports with the physical interfaces
>under the control of a user process.
>
>> How is this handle in other drivers like bond, openvswitch, bridge, etc?
>
>It looks like bridge is using it, looking at br_port_carrier_check() and
>br_add_if().

Okay, so why do you still need to check netif_carrier_ok?
Looks like netif_oper_up is enough, right?


>
>Cheers.
>
>> 
>> >
>> >Cheers.
>> >
>> >> 
>> >> >
>> >> >Signed-off-by: George Wilkie <gwilkie@vyatta.att-mail.com>
>> >> >---
>> >> > drivers/net/team/team.c | 3 ++-
>> >> > 1 file changed, 2 insertions(+), 1 deletion(-)
>> >> >
>> >> >diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>> >> >index a6c6ce19eeee..231264a05e55 100644
>> >> >--- a/drivers/net/team/team.c
>> >> >+++ b/drivers/net/team/team.c
>> >> >@@ -2918,7 +2918,8 @@ static int team_device_event(struct notifier_block *unused,
>> >> > 	case NETDEV_CHANGE:
>> >> > 		if (netif_running(port->dev))
>> >> > 			team_port_change_check(port,
>> >> >-					       !!netif_carrier_ok(port->dev));
>> >> >+					       !!(netif_carrier_ok(port->dev) &&
>> >> >+						  netif_oper_up(port->dev)));
>> >> > 		break;
>> >> > 	case NETDEV_UNREGISTER:
>> >> > 		team_del_slave(port->team->dev, dev);
>> >> >-- 
>> >> >2.11.0
>> >> >

^ permalink raw reply

* Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache
From: Eric Dumazet @ 2018-04-18 19:21 UTC (permalink / raw)
  To: Paolo Abeni, netdev; +Cc: David S. Miller
In-Reply-To: <1524071712.2599.60.camel@redhat.com>

On 04/18/2018 10:15 AM, Paolo Abeni wrote:
is not appealing to me :/
> 
> Thank you for the feedback.
> Sorry for not being clear about it, but knotd is using SO_REUSEPORT and
> the above tests are leveraging it.
> 
> That 5% is on top of that 300%.

Then there is something wrong.

Adding copies should not increase performance.

If it does, there is certainly another way, reaching 10% instead of 5%

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox