Netdev List
 help / color / mirror / Atom feed
* RE: [PATCH V6 5/5] net: netvsc: Add Isolation VM support for netvsc driver
From: Michael Kelley (LINUX) @ 2021-12-09 19:54 UTC (permalink / raw)
  To: Haiyang Zhang, Tianyu Lan, KY Srinivasan, Stephen Hemminger,
	wei.liu@kernel.org, Dexuan Cui, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, davem@davemloft.net,
	kuba@kernel.org, jejb@linux.ibm.com, martin.petersen@oracle.com,
	arnd@arndb.de, hch@infradead.org, m.szyprowski@samsung.com,
	robin.murphy@arm.com, Tianyu Lan, thomas.lendacky@amd.com
  Cc: iommu@lists.linux-foundation.org, linux-arch@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-scsi@vger.kernel.org, netdev@vger.kernel.org, vkuznets,
	brijesh.singh@amd.com, konrad.wilk@oracle.com, hch@lst.de,
	joro@8bytes.org, parri.andrea@gmail.com, dave.hansen@intel.com
In-Reply-To: <BN8PR21MB128401EEDE6B8C8553CC8009CA6F9@BN8PR21MB1284.namprd21.prod.outlook.com>

From: Haiyang Zhang <haiyangz@microsoft.com> Sent: Wednesday, December 8, 2021 12:14 PM
> > From: Tianyu Lan <ltykernel@gmail.com>
> > Sent: Tuesday, December 7, 2021 2:56 AM

[snip]

> >  static inline int netvsc_send_pkt(
> >  	struct hv_device *device,
> >  	struct hv_netvsc_packet *packet,
> > @@ -986,14 +1105,24 @@ static inline int netvsc_send_pkt(
> >
> >  	trace_nvsp_send_pkt(ndev, out_channel, rpkt);
> >
> > +	packet->dma_range = NULL;
> >  	if (packet->page_buf_cnt) {
> >  		if (packet->cp_partial)
> >  			pb += packet->rmsg_pgcnt;
> >
> > +		ret = netvsc_dma_map(ndev_ctx->device_ctx, packet, pb);
> > +		if (ret) {
> > +			ret = -EAGAIN;
> > +			goto exit;
> > +		}
> 
> Returning EAGAIN will let the upper network layer busy retry,
> which may make things worse.
> I suggest to return ENOSPC here like another place in this
> function, which will just drop the packet, and let the network
> protocol/app layer decide how to recover.
> 
> Thanks,
> - Haiyang

I made the original suggestion to return -EAGAIN here.   A
DMA mapping failure should occur only if swiotlb bounce
buffer space is unavailable, which is a transient condition.
The existing code already stops the queue and returns
-EAGAIN when the ring buffer is full, which is also a transient
condition.  My sense is that the two conditions should be
handled the same way.  Or is there a reason why a ring
buffer full condition should stop the queue and retry, while
a mapping failure should drop the packet?

Michael

^ permalink raw reply

* Re: [PATCH net-next] xfrm: use net device refcount tracker helpers
From: Jakub Kicinski @ 2021-12-09 19:53 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Eric Dumazet, David S . Miller, netdev, Eric Dumazet, Cong Wang
In-Reply-To: <20211209092036.GK427717@gauss3.secunet.de>

On Thu, 9 Dec 2021 10:20:36 +0100 Steffen Klassert wrote:
> On Tue, Dec 07, 2021 at 11:32:03AM -0800, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > release dev before returning error")
> > 
> > Fixes: 9038c320001d ("net: dst: add net device refcount tracking to dst_entry")
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Cong Wang <amwang@redhat.com>
> > Cc: Steffen Klassert <steffen.klassert@secunet.com>  
> 
> As the refcount tracking infrastructure is not yet in ipsec-next:
> 
> Acked-by: Steffen Klassert <steffen.klassert@secunet.com>

Applied, thanks!


^ permalink raw reply

* RE: [PATCH bpf-next 6/8] bpf: Add XDP_REDIRECT support to XDP for bpf_prog_run()
From: Toke Høiland-Jørgensen @ 2021-12-09 19:49 UTC (permalink / raw)
  To: John Fastabend, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer
  Cc: netdev, bpf
In-Reply-To: <61b25147bc136_6bfb208c5@john.notmuch>

John Fastabend <john.fastabend@gmail.com> writes:

> Toke Høiland-Jørgensen wrote:
>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>> 
>> > John Fastabend <john.fastabend@gmail.com> writes:
>> >
>> >> Toke Høiland-Jørgensen wrote:
>> >>> This adds support for doing real redirects when an XDP program returns
>> >>> XDP_REDIRECT in bpf_prog_run(). To achieve this, we create a page pool
>> >>> instance while setting up the test run, and feed pages from that into the
>> >>> XDP program. The setup cost of this is amortised over the number of
>> >>> repetitions specified by userspace.
>> >>> 
>> >>> To support performance testing use case, we further optimise the setup step
>> >>> so that all pages in the pool are pre-initialised with the packet data, and
>> >>> pre-computed context and xdp_frame objects stored at the start of each
>> >>> page. This makes it possible to entirely avoid touching the page content on
>> >>> each XDP program invocation, and enables sending up to 11.5 Mpps/core on my
>> >>> test box.
>> >>> 
>> >>> Because the data pages are recycled by the page pool, and the test runner
>> >>> doesn't re-initialise them for each run, subsequent invocations of the XDP
>> >>> program will see the packet data in the state it was after the last time it
>> >>> ran on that particular page. This means that an XDP program that modifies
>> >>> the packet before redirecting it has to be careful about which assumptions
>> >>> it makes about the packet content, but that is only an issue for the most
>> >>> naively written programs.
>> >>> 
>> >>> Previous uses of bpf_prog_run() for XDP returned the modified packet data
>> >>> and return code to userspace, which is a different semantic then this new
>> >>> redirect mode. For this reason, the caller has to set the new
>> >>> BPF_F_TEST_XDP_DO_REDIRECT flag when calling bpf_prog_run() to opt in to
>> >>> the different semantics. Enabling this flag is only allowed if not setting
>> >>> ctx_out and data_out in the test specification, since it means frames will
>> >>> be redirected somewhere else, so they can't be returned.
>> >>> 
>> >>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> >>> ---
>> >>
>> >> [...]
>> >>
>> >>> +static int bpf_test_run_xdp_redirect(struct bpf_test_timer *t,
>> >>> +				     struct bpf_prog *prog, struct xdp_buff *orig_ctx)
>> >>> +{
>> >>> +	void *data, *data_end, *data_meta;
>> >>> +	struct xdp_frame *frm;
>> >>> +	struct xdp_buff *ctx;
>> >>> +	struct page *page;
>> >>> +	int ret, err = 0;
>> >>> +
>> >>> +	page = page_pool_dev_alloc_pages(t->xdp.pp);
>> >>> +	if (!page)
>> >>> +		return -ENOMEM;
>> >>> +
>> >>> +	ctx = ctx_from_page(page);
>> >>> +	data = ctx->data;
>> >>> +	data_meta = ctx->data_meta;
>> >>> +	data_end = ctx->data_end;
>> >>> +
>> >>> +	ret = bpf_prog_run_xdp(prog, ctx);
>> >>> +	if (ret == XDP_REDIRECT) {
>> >>> +		frm = (struct xdp_frame *)(ctx + 1);
>> >>> +		/* if program changed pkt bounds we need to update the xdp_frame */
>> >>
>> >> Because this reuses the frame repeatedly is there any issue with also
>> >> updating the ctx each time? Perhaps if the prog keeps shrinking
>> >> the pkt it might wind up with 0 len pkt? Just wanted to ask.
>> >
>> > Sure, it could. But the data buffer comes from userspace anyway, and
>> > there's nothing preventing userspace from passing a 0-length packet
>> > anyway, so I just mentally put this in the "don't do that, then" bucket :)
>> >
>> > At least I don't *think* there's actually any problem with this that we
>> > don't have already? A regular XDP program can also shrink an incoming
>> > packet to zero, then redirect it, no?
>> 
>> Another thought is that we could of course do the opposite here: instead
>> of updating the xdp_frame when the program resizes the packet, just
>> reset the pointers so that the next invocation will get the original
>> size again? The data would still be changed, but maybe that behaviour is
>> less surprising? WDYT?
>
> Should read my email from newest to oldest :)
>
> I think resetting it back to the original size is less surprising. And
> if I want to benchmark a helper that moves the pointers it will be
> easier. For example benchmarking shrinking a packet with current
> code wouldn't really work because eventually the packet will be 0
> and my test will stop doing what I expect.

Ah yes, good point!

> Lets do the reset back to original size.

Alright, will do; thanks! :)

-Toke


^ permalink raw reply

* Re: [GIT PULL] Networking for 5.16-rc5
From: pr-tracker-bot @ 2021-12-09 19:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: torvalds, kuba, davem, netdev, bpf, linux-kernel, netfilter-devel,
	linux-can
In-Reply-To: <20211209172032.610738-1-kuba@kernel.org>

The pull request you sent on Thu,  9 Dec 2021 09:20:32 -0800:

> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git tags/net-5.16-rc5

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/ded746bfc94398d2ee9de315a187677b207b2004

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [PATCH] bpf: return EOPNOTSUPP when JIT is needed and not possible
From: Ido Schimmel @ 2021-12-09 19:31 UTC (permalink / raw)
  To: John Fastabend
  Cc: Thadeu Lima de Souza Cascardo, bpf, netdev, ast, daniel,
	linux-kernel
In-Reply-To: <61b2536e5161d_6bfb2089@john.notmuch>

On Thu, Dec 09, 2021 at 11:05:18AM -0800, John Fastabend wrote:
> Thadeu Lima de Souza Cascardo wrote:
> > When a CBPF program is JITed and CONFIG_BPF_JIT_ALWAYS_ON is enabled, and
> > the JIT fails, it would return ENOTSUPP, which is not a valid userspace
> > error code.  Instead, EOPNOTSUPP should be returned.
> > 
> > Fixes: 290af86629b2 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
> > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
> > ---
> >  kernel/bpf/core.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index de3e5bc6781f..5c89bae0d6f9 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -1931,7 +1931,7 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
> >  		fp = bpf_int_jit_compile(fp);
> >  		bpf_prog_jit_attempt_done(fp);
> >  		if (!fp->jited && jit_needed) {
> > -			*err = -ENOTSUPP;
> > +			*err = -EOPNOTSUPP;
> >  			return fp;
> >  		}
> >  	} else {
> > -- 
> > 2.32.0
> > 
> 
> It seems BPF subsys returns ENOTSUPP in multiple places. This fixes one
> paticular case and is user facing. Not sure we want to one-off fix them
> here creating user facing changes over multiple kernel versions. On the
> fence with this one curious to see what others think. Haven't apps
> already adapted to the current convention or they don't care?

Similar issue was discussed in the past. See:
https://lore.kernel.org/netdev/20191204.125135.750458923752225025.davem@davemloft.net/

^ permalink raw reply

* RE: [PATCH bpf-next 0/2] Introduce TCP_ULP option for bpf_{set,get}sockopt
From: John Fastabend @ 2021-12-09 19:27 UTC (permalink / raw)
  To: Tony Lu, ast, daniel, andrii; +Cc: bpf, netdev
In-Reply-To: <20211209090250.73927-1-tonylu@linux.alibaba.com>

Tony Lu wrote:
> This patch set introduces a new option TCP_ULP for bpf_{set,get}sockopt
> helper. The bpf prog can set and get TCP_ULP sock option on demand.
> 
> With this, the bpf prog can set TCP_ULP based on strategies when socket
> create or other's socket hook point. For example, the bpf prog can
> control which socket should use tls or smc (WIP) ULP modules without
> modifying the applications.
> 
> Patch 1 replaces if statement with switch to make it easy to extend.
> 
> Patch 2 introduces TCP_ULP sock option.

Can you be a bit more specific on what ULP you are going to load on
demand here and how that would work? For TLS I can't see how this will
work, please elaborate. Because the user space side (e.g. openssl) behaves
differently if running in kTLS vs uTLS modes I don't think you can
from kernel side just flip it on? I'm a bit intrigued though on what
might happen if we do did do this on an active socket, but seems it
wouldn't be normal TLS with handshake and keys at that point? I'm
not sure we need to block it from happening, but struggling to see
how its useful at the moment.

The smc case looks promising, but for that we need to get the order
correct and merge smc first and then this series.

Also this will need a selftests.

Thanks,
John

^ permalink raw reply

* RE: [PATCH] bpf: return EOPNOTSUPP when JIT is needed and not possible
From: John Fastabend @ 2021-12-09 19:05 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo, bpf; +Cc: netdev, ast, daniel, linux-kernel
In-Reply-To: <20211209134038.41388-1-cascardo@canonical.com>

Thadeu Lima de Souza Cascardo wrote:
> When a CBPF program is JITed and CONFIG_BPF_JIT_ALWAYS_ON is enabled, and
> the JIT fails, it would return ENOTSUPP, which is not a valid userspace
> error code.  Instead, EOPNOTSUPP should be returned.
> 
> Fixes: 290af86629b2 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
> ---
>  kernel/bpf/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index de3e5bc6781f..5c89bae0d6f9 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -1931,7 +1931,7 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
>  		fp = bpf_int_jit_compile(fp);
>  		bpf_prog_jit_attempt_done(fp);
>  		if (!fp->jited && jit_needed) {
> -			*err = -ENOTSUPP;
> +			*err = -EOPNOTSUPP;
>  			return fp;
>  		}
>  	} else {
> -- 
> 2.32.0
> 

It seems BPF subsys returns ENOTSUPP in multiple places. This fixes one
paticular case and is user facing. Not sure we want to one-off fix them
here creating user facing changes over multiple kernel versions. On the
fence with this one curious to see what others think. Haven't apps
already adapted to the current convention or they don't care?

.John

^ permalink raw reply

* [syzbot] KMSAN: uninit-value in fib_get_nhs
From: syzbot @ 2021-12-09 18:57 UTC (permalink / raw)
  To: davem, dsahern, glider, kuba, linux-kernel, netdev,
	syzkaller-bugs, yoshfuji

Hello,

syzbot found the following issue on:

HEAD commit:    8b936c96768e kmsan: core: remove the accidentally committe..
git tree:       https://github.com/google/kmsan.git master
console output: https://syzkaller.appspot.com/x/log.txt?x=1724ebc5b00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=e00a8959fdd3f3e8
dashboard link: https://syzkaller.appspot.com/bug?extid=d4b9a2851cc3ce998741
compiler:       clang version 14.0.0 (git@github.com:llvm/llvm-project.git 0996585c8e3b3d409494eb5f1cad714b9e1f7fb5), GNU ld (GNU Binutils for Debian) 2.35.2
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1225f875b00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=139513c5b00000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+d4b9a2851cc3ce998741@syzkaller.appspotmail.com

=====================================================
BUG: KMSAN: uninit-value in fib_get_nhs+0xac4/0x1f80 net/ipv4/fib_semantics.c:708
 fib_get_nhs+0xac4/0x1f80 net/ipv4/fib_semantics.c:708
 fib_create_info+0x2411/0x4870 net/ipv4/fib_semantics.c:1453
 fib_table_insert+0x45c/0x3a10 net/ipv4/fib_trie.c:1224
 inet_rtm_newroute+0x289/0x420 net/ipv4/fib_frontend.c:886
 rtnetlink_rcv_msg+0x145d/0x18c0 net/core/rtnetlink.c:5571
 netlink_rcv_skb+0x447/0x800 net/netlink/af_netlink.c:2491
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5589
 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
 netlink_unicast+0x1095/0x1360 net/netlink/af_netlink.c:1345
 netlink_sendmsg+0x16f3/0x1870 net/netlink/af_netlink.c:1916
 sock_sendmsg_nosec net/socket.c:704 [inline]
 sock_sendmsg net/socket.c:724 [inline]
 ____sys_sendmsg+0xe11/0x12c0 net/socket.c:2409
 ___sys_sendmsg net/socket.c:2463 [inline]
 __sys_sendmsg+0x4a5/0x640 net/socket.c:2492
 __do_sys_sendmsg net/socket.c:2501 [inline]
 __se_sys_sendmsg net/socket.c:2499 [inline]
 __x64_sys_sendmsg+0xe2/0x120 net/socket.c:2499
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Uninit was created at:
 slab_post_alloc_hook mm/slab.h:524 [inline]
 slab_alloc_node mm/slub.c:3251 [inline]
 __kmalloc_node_track_caller+0xe0c/0x1510 mm/slub.c:4974
 kmalloc_reserve net/core/skbuff.c:354 [inline]
 __alloc_skb+0x545/0xf90 net/core/skbuff.c:426
 alloc_skb include/linux/skbuff.h:1126 [inline]
 netlink_alloc_large_skb net/netlink/af_netlink.c:1191 [inline]
 netlink_sendmsg+0xe93/0x1870 net/netlink/af_netlink.c:1891
 sock_sendmsg_nosec net/socket.c:704 [inline]
 sock_sendmsg net/socket.c:724 [inline]
 ____sys_sendmsg+0xe11/0x12c0 net/socket.c:2409
 ___sys_sendmsg net/socket.c:2463 [inline]
 __sys_sendmsg+0x4a5/0x640 net/socket.c:2492
 __do_sys_sendmsg net/socket.c:2501 [inline]
 __se_sys_sendmsg net/socket.c:2499 [inline]
 __x64_sys_sendmsg+0xe2/0x120 net/socket.c:2499
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x44/0xae

CPU: 0 PID: 6371 Comm: syz-executor193 Not tainted 5.16.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
=====================================================


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* RE: [PATCH bpf-next 6/8] bpf: Add XDP_REDIRECT support to XDP for bpf_prog_run()
From: John Fastabend @ 2021-12-09 18:56 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, John Fastabend,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: netdev, bpf
In-Reply-To: <87r1alwwk4.fsf@toke.dk>

Toke Høiland-Jørgensen wrote:
> Toke Høiland-Jørgensen <toke@redhat.com> writes:
> 
> > John Fastabend <john.fastabend@gmail.com> writes:
> >
> >> Toke Høiland-Jørgensen wrote:
> >>> This adds support for doing real redirects when an XDP program returns
> >>> XDP_REDIRECT in bpf_prog_run(). To achieve this, we create a page pool
> >>> instance while setting up the test run, and feed pages from that into the
> >>> XDP program. The setup cost of this is amortised over the number of
> >>> repetitions specified by userspace.
> >>> 
> >>> To support performance testing use case, we further optimise the setup step
> >>> so that all pages in the pool are pre-initialised with the packet data, and
> >>> pre-computed context and xdp_frame objects stored at the start of each
> >>> page. This makes it possible to entirely avoid touching the page content on
> >>> each XDP program invocation, and enables sending up to 11.5 Mpps/core on my
> >>> test box.
> >>> 
> >>> Because the data pages are recycled by the page pool, and the test runner
> >>> doesn't re-initialise them for each run, subsequent invocations of the XDP
> >>> program will see the packet data in the state it was after the last time it
> >>> ran on that particular page. This means that an XDP program that modifies
> >>> the packet before redirecting it has to be careful about which assumptions
> >>> it makes about the packet content, but that is only an issue for the most
> >>> naively written programs.
> >>> 
> >>> Previous uses of bpf_prog_run() for XDP returned the modified packet data
> >>> and return code to userspace, which is a different semantic then this new
> >>> redirect mode. For this reason, the caller has to set the new
> >>> BPF_F_TEST_XDP_DO_REDIRECT flag when calling bpf_prog_run() to opt in to
> >>> the different semantics. Enabling this flag is only allowed if not setting
> >>> ctx_out and data_out in the test specification, since it means frames will
> >>> be redirected somewhere else, so they can't be returned.
> >>> 
> >>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> >>> ---
> >>
> >> [...]
> >>
> >>> +static int bpf_test_run_xdp_redirect(struct bpf_test_timer *t,
> >>> +				     struct bpf_prog *prog, struct xdp_buff *orig_ctx)
> >>> +{
> >>> +	void *data, *data_end, *data_meta;
> >>> +	struct xdp_frame *frm;
> >>> +	struct xdp_buff *ctx;
> >>> +	struct page *page;
> >>> +	int ret, err = 0;
> >>> +
> >>> +	page = page_pool_dev_alloc_pages(t->xdp.pp);
> >>> +	if (!page)
> >>> +		return -ENOMEM;
> >>> +
> >>> +	ctx = ctx_from_page(page);
> >>> +	data = ctx->data;
> >>> +	data_meta = ctx->data_meta;
> >>> +	data_end = ctx->data_end;
> >>> +
> >>> +	ret = bpf_prog_run_xdp(prog, ctx);
> >>> +	if (ret == XDP_REDIRECT) {
> >>> +		frm = (struct xdp_frame *)(ctx + 1);
> >>> +		/* if program changed pkt bounds we need to update the xdp_frame */
> >>
> >> Because this reuses the frame repeatedly is there any issue with also
> >> updating the ctx each time? Perhaps if the prog keeps shrinking
> >> the pkt it might wind up with 0 len pkt? Just wanted to ask.
> >
> > Sure, it could. But the data buffer comes from userspace anyway, and
> > there's nothing preventing userspace from passing a 0-length packet
> > anyway, so I just mentally put this in the "don't do that, then" bucket :)
> >
> > At least I don't *think* there's actually any problem with this that we
> > don't have already? A regular XDP program can also shrink an incoming
> > packet to zero, then redirect it, no?
> 
> Another thought is that we could of course do the opposite here: instead
> of updating the xdp_frame when the program resizes the packet, just
> reset the pointers so that the next invocation will get the original
> size again? The data would still be changed, but maybe that behaviour is
> less surprising? WDYT?

Should read my email from newest to oldest :)

I think resetting it back to the original size is less surprising. And
if I want to benchmark a helper that moves the pointers it will be
easier. For example benchmarking shrinking a packet with current
code wouldn't really work because eventually the packet will be 0
and my test will stop doing what I expect.

Lets do the reset back to original size.

Thanks,
John

> 
> -Toke
> 



^ permalink raw reply

* RE: [PATCH bpf-next 6/8] bpf: Add XDP_REDIRECT support to XDP for bpf_prog_run()
From: John Fastabend @ 2021-12-09 18:53 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, John Fastabend,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: netdev, bpf
In-Reply-To: <87tufhwygr.fsf@toke.dk>

Toke Høiland-Jørgensen wrote:
> John Fastabend <john.fastabend@gmail.com> writes:
> 
> > Toke Høiland-Jørgensen wrote:
> >> This adds support for doing real redirects when an XDP program returns
> >> XDP_REDIRECT in bpf_prog_run(). To achieve this, we create a page pool
> >> instance while setting up the test run, and feed pages from that into the
> >> XDP program. The setup cost of this is amortised over the number of
> >> repetitions specified by userspace.
> >> 
> >> To support performance testing use case, we further optimise the setup step
> >> so that all pages in the pool are pre-initialised with the packet data, and
> >> pre-computed context and xdp_frame objects stored at the start of each
> >> page. This makes it possible to entirely avoid touching the page content on
> >> each XDP program invocation, and enables sending up to 11.5 Mpps/core on my
> >> test box.
> >> 
> >> Because the data pages are recycled by the page pool, and the test runner
> >> doesn't re-initialise them for each run, subsequent invocations of the XDP
> >> program will see the packet data in the state it was after the last time it
> >> ran on that particular page. This means that an XDP program that modifies
> >> the packet before redirecting it has to be careful about which assumptions
> >> it makes about the packet content, but that is only an issue for the most
> >> naively written programs.
> >> 
> >> Previous uses of bpf_prog_run() for XDP returned the modified packet data
> >> and return code to userspace, which is a different semantic then this new
> >> redirect mode. For this reason, the caller has to set the new
> >> BPF_F_TEST_XDP_DO_REDIRECT flag when calling bpf_prog_run() to opt in to
> >> the different semantics. Enabling this flag is only allowed if not setting
> >> ctx_out and data_out in the test specification, since it means frames will
> >> be redirected somewhere else, so they can't be returned.
> >> 
> >> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> >> ---
> >
> > [...]
> >
> >> +static int bpf_test_run_xdp_redirect(struct bpf_test_timer *t,
> >> +				     struct bpf_prog *prog, struct xdp_buff *orig_ctx)
> >> +{
> >> +	void *data, *data_end, *data_meta;
> >> +	struct xdp_frame *frm;
> >> +	struct xdp_buff *ctx;
> >> +	struct page *page;
> >> +	int ret, err = 0;
> >> +
> >> +	page = page_pool_dev_alloc_pages(t->xdp.pp);
> >> +	if (!page)
> >> +		return -ENOMEM;
> >> +
> >> +	ctx = ctx_from_page(page);
> >> +	data = ctx->data;
> >> +	data_meta = ctx->data_meta;
> >> +	data_end = ctx->data_end;
> >> +
> >> +	ret = bpf_prog_run_xdp(prog, ctx);
> >> +	if (ret == XDP_REDIRECT) {
> >> +		frm = (struct xdp_frame *)(ctx + 1);
> >> +		/* if program changed pkt bounds we need to update the xdp_frame */
> >
> > Because this reuses the frame repeatedly is there any issue with also
> > updating the ctx each time? Perhaps if the prog keeps shrinking
> > the pkt it might wind up with 0 len pkt? Just wanted to ask.
> 
> Sure, it could. But the data buffer comes from userspace anyway, and
> there's nothing preventing userspace from passing a 0-length packet
> anyway, so I just mentally put this in the "don't do that, then" bucket :)
> 
> At least I don't *think* there's actually any problem with this that we
> don't have already? A regular XDP program can also shrink an incoming
> packet to zero, then redirect it, no?
> 
> -Toke
> 

Agree, I don't see any real issue with it. Just wnated to be sure we
thought through it.

Thanks!
John

^ permalink raw reply

* Re: [PATCH v4 net-next 2/9] i40e: respect metadata on XSK Rx to skb
From: Nguyen, Anthony L @ 2021-12-09 18:50 UTC (permalink / raw)
  To: Lobakin, Alexandr, jbrouer@redhat.com
  Cc: songliubraving@fb.com, hawk@kernel.org, kafai@fb.com,
	andrii@kernel.org, davem@davemloft.net, Fijalkowski, Maciej,
	john.fastabend@gmail.com, Brandeburg, Jesse, kpsingh@kernel.org,
	ast@kernel.org, linux-kernel@vger.kernel.org, brouer@redhat.com,
	yhs@fb.com, Joseph, Jithu, intel-wired-lan@lists.osuosl.org,
	kuba@kernel.org, bjorn@kernel.org, daniel@iogearbox.net,
	netdev@vger.kernel.org, michal.swiatkowski@linux.intel.com,
	bpf@vger.kernel.org
In-Reply-To: <20211209173816.5157-1-alexandr.lobakin@intel.com>

On Thu, 2021-12-09 at 18:38 +0100, Alexander Lobakin wrote:
> From: Jesper Dangaard Brouer <jbrouer@redhat.com>
> Date: Thu, 9 Dec 2021 09:27:37 +0100
> 
> > On 08/12/2021 15.06, Alexander Lobakin wrote:
> > > For now, if the XDP prog returns XDP_PASS on XSK, the metadata
> > > will
> > > be lost as it doesn't get copied to the skb.
> > 
> > I have an urge to add a newline here, when reading this, as IMHO it
> > is a 
> > paragraph with the problem statement.
> > 
> > > Copy it along with the frame headers. Account its size on skb
> > > allocation, and when copying just treat it as a part of the frame
> > > and do a pull after to "move" it to the "reserved" zone.
> > 
> > Also newline here, as next paragraph are some extra details, you
> > felt a 
> > need to explain to the reader.
> > 
> > > net_prefetch() xdp->data_meta and align the copy size to speed-up
> > > memcpy() a little and better match i40e_costruct_skb().
> >                                       ^^^^^^xx^^^^^^^^^
> > 
> > commit messages.
> 
> Oh gosh, I thought I don't have attention deficit. Thanks, maybe
> Tony will fix it for me or I could send a follow-up (or resend if
> needed, I saw those were already applied to dev-queue).

If there's no need for follow-ups beyond this change, I'll fix it up.

Thanks,
Tony

> > --Jesper
> 
> Al


^ permalink raw reply

* [PATCH net] inet_diag: fix kernel-infoleak for UDP sockets
From: Eric Dumazet @ 2021-12-09 18:50 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Eric Dumazet, syzbot

From: Eric Dumazet <edumazet@google.com>

KMSAN reported a kernel-infoleak [1], that can exploited
by unpriv users.

After analysis it turned out UDP was not initializing
r->idiag_expires. Other users of inet_sk_diag_fill()
might make the same mistake in the future, so fix this
in inet_sk_diag_fill().

[1]
BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:121 [inline]
BUG: KMSAN: kernel-infoleak in copyout lib/iov_iter.c:156 [inline]
BUG: KMSAN: kernel-infoleak in _copy_to_iter+0x69d/0x25c0 lib/iov_iter.c:670
 instrument_copy_to_user include/linux/instrumented.h:121 [inline]
 copyout lib/iov_iter.c:156 [inline]
 _copy_to_iter+0x69d/0x25c0 lib/iov_iter.c:670
 copy_to_iter include/linux/uio.h:155 [inline]
 simple_copy_to_iter+0xf3/0x140 net/core/datagram.c:519
 __skb_datagram_iter+0x2cb/0x1280 net/core/datagram.c:425
 skb_copy_datagram_iter+0xdc/0x270 net/core/datagram.c:533
 skb_copy_datagram_msg include/linux/skbuff.h:3657 [inline]
 netlink_recvmsg+0x660/0x1c60 net/netlink/af_netlink.c:1974
 sock_recvmsg_nosec net/socket.c:944 [inline]
 sock_recvmsg net/socket.c:962 [inline]
 sock_read_iter+0x5a9/0x630 net/socket.c:1035
 call_read_iter include/linux/fs.h:2156 [inline]
 new_sync_read fs/read_write.c:400 [inline]
 vfs_read+0x1631/0x1980 fs/read_write.c:481
 ksys_read+0x28c/0x520 fs/read_write.c:619
 __do_sys_read fs/read_write.c:629 [inline]
 __se_sys_read fs/read_write.c:627 [inline]
 __x64_sys_read+0xdb/0x120 fs/read_write.c:627
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Uninit was created at:
 slab_post_alloc_hook mm/slab.h:524 [inline]
 slab_alloc_node mm/slub.c:3251 [inline]
 __kmalloc_node_track_caller+0xe0c/0x1510 mm/slub.c:4974
 kmalloc_reserve net/core/skbuff.c:354 [inline]
 __alloc_skb+0x545/0xf90 net/core/skbuff.c:426
 alloc_skb include/linux/skbuff.h:1126 [inline]
 netlink_dump+0x3d5/0x16a0 net/netlink/af_netlink.c:2245
 __netlink_dump_start+0xd1c/0xee0 net/netlink/af_netlink.c:2370
 netlink_dump_start include/linux/netlink.h:254 [inline]
 inet_diag_handler_cmd+0x2e7/0x400 net/ipv4/inet_diag.c:1343
 sock_diag_rcv_msg+0x24a/0x620
 netlink_rcv_skb+0x447/0x800 net/netlink/af_netlink.c:2491
 sock_diag_rcv+0x63/0x80 net/core/sock_diag.c:276
 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
 netlink_unicast+0x1095/0x1360 net/netlink/af_netlink.c:1345
 netlink_sendmsg+0x16f3/0x1870 net/netlink/af_netlink.c:1916
 sock_sendmsg_nosec net/socket.c:704 [inline]
 sock_sendmsg net/socket.c:724 [inline]
 sock_write_iter+0x594/0x690 net/socket.c:1057
 do_iter_readv_writev+0xa7f/0xc70
 do_iter_write+0x52c/0x1500 fs/read_write.c:851
 vfs_writev fs/read_write.c:924 [inline]
 do_writev+0x63f/0xe30 fs/read_write.c:967
 __do_sys_writev fs/read_write.c:1040 [inline]
 __se_sys_writev fs/read_write.c:1037 [inline]
 __x64_sys_writev+0xe5/0x120 fs/read_write.c:1037
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Bytes 68-71 of 312 are uninitialized
Memory access of size 312 starts at ffff88812ab54000
Data copied to user address 0000000020001440

CPU: 1 PID: 6365 Comm: syz-executor801 Not tainted 5.16.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 3c4d05c80567 ("inet_diag: Introduce the inet socket dumping routine")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
---
 net/ipv4/inet_diag.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index c8fa6e7f7d1241691e048c868878b388e04b6f80..581b5b2d72a5bcc9c1ddf2b3664ef9ff669f63a5 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -261,6 +261,7 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
 	r->idiag_state = sk->sk_state;
 	r->idiag_timer = 0;
 	r->idiag_retrans = 0;
+	r->idiag_expires = 0;
 
 	if (inet_diag_msg_attrs_fill(sk, skb, r, ext,
 				     sk_user_ns(NETLINK_CB(cb->skb).sk),
@@ -314,9 +315,6 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
 		r->idiag_retrans = icsk->icsk_probes_out;
 		r->idiag_expires =
 			jiffies_delta_to_msecs(sk->sk_timer.expires - jiffies);
-	} else {
-		r->idiag_timer = 0;
-		r->idiag_expires = 0;
 	}
 
 	if ((ext & (1 << (INET_DIAG_INFO - 1))) && handler->idiag_info_size) {
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related

* Re: [RFC PATCH net-next 0/7] DSA master state tracking
From: Ansuel Smith @ 2021-12-09 18:34 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev@vger.kernel.org, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli
In-Reply-To: <20211209175720.cpifrbibghapm7eo@skbuf>

[-- Attachment #1: Type: text/plain, Size: 901 bytes --]

On Thu, Dec 09, 2021 at 05:57:20PM +0000, Vladimir Oltean wrote:
> On Thu, Dec 09, 2021 at 06:44:38PM +0100, Ansuel Smith wrote:
> > Ok will send a pcap. Any preferred way to send it?
> 
> Email attachment should be fine.

This is a pcap done on the cpu port eth0.
All the packet with lenght 60 are the Ethernet mdio packet.
The capture is done with no cable connected so don't know why there are
some broadcast for ipv6.
Special care about the fact that the qca tag is always present in the
EtherType.
Mdio request qca tag 8181
mdio ack qca tag b887

(I should really investigate the extra packet... In theory they are not
sent (and i remember checking this) by the tagger as they have bit 5:4
set to 0x2... From Documentation these bit are reserved and in our code
we always set them to zero. The switch by itself sends broadcast tagged
packet and respond itself?)

I attached the pcap.
-- 
	Ansuel

[-- Attachment #2: pkts.pcap --]
[-- Type: application/vnd.tcpdump.pcap, Size: 29944 bytes --]

^ permalink raw reply

* Re: [RFC PATCH net-next 0/7] DSA master state tracking
From: Ansuel Smith @ 2021-12-09 18:16 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev@vger.kernel.org, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli
In-Reply-To: <20211209175617.652rdiidc6pfgdwz@skbuf>

On Thu, Dec 09, 2021 at 05:56:17PM +0000, Vladimir Oltean wrote:
> On Thu, Dec 09, 2021 at 06:44:38PM +0100, Ansuel Smith wrote:
> > > I think the problem is that we also need to track the operstate of the
> > > master (netif_oper_up via NETDEV_CHANGE) before declaring it as good to go.
> > > You can see that this is exactly the line after which the timeouts disappear:
> > > 
> > > [    7.146901] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> > > 
> > > I didn't really want to go there, because now I'm not sure how to
> > > synthesize the information for the switch drivers to consume it.
> > > Anyway I've prepared a v2 patchset and I'll send it out very soon.
> > 
> > Wonder if we should leave the driver decide when it's ready by parsing
> > the different state? (And change
> > the up ops to something like a generic change?)
> 
> There isn't just one state to track, which is precisely the problem that
> I had to deal with for v2. The master is operational during the time
> frame between NETDEV_UP and NETDEV_GOING_DOWN, intersected with the
> interval during which netif_oper_up(master) is true. So in the simple
> state propagation approach, DSA would need to provide at least two ops
> to switches, one for admin state and the other for oper state. And the
> switch driver would need to AND the two and keep state by itself.
> Letting the driver make the decision would have been acceptable to me if
> we could have 3 ops and a common implementation, something like this:
> 
> static void qca8k_master_state_change(struct dsa_switch *ds,
> 				      const struct dsa_master *master)
> {
> 	bool operational = (master->flags & IFF_UP) && netif_oper_up(master);
> }
> 
> 	.master_admin_state_change	= qca8k_master_state_change,
> 	.master_oper_state_change	= qca8k_master_state_change,
> 
> but the problem is that during NETDEV_GOING_DOWN, master->flags & IFF_UP
> is still true, so this wouldn't work. And replacing the NETDEV_GOING_DOWN
> notifier with the NETDEV_DOWN one would solve that problem, but it would
> no longer guarantee that the switch can disable this feature without
> timeouts before the master is down - because now it _is_ down.

Ok will have to test v2 and check if this is also fixed.

-- 
	Ansuel

^ permalink raw reply

* Re: [RFC PATCH net-next 0/7] DSA master state tracking
From: Vladimir Oltean @ 2021-12-09 17:57 UTC (permalink / raw)
  To: Ansuel Smith
  Cc: netdev@vger.kernel.org, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli
In-Reply-To: <61b24089.1c69fb81.c8040.164b@mx.google.com>

On Thu, Dec 09, 2021 at 06:44:38PM +0100, Ansuel Smith wrote:
> Ok will send a pcap. Any preferred way to send it?

Email attachment should be fine.

^ permalink raw reply

* Re: [RFC PATCH net-next 0/7] DSA master state tracking
From: Vladimir Oltean @ 2021-12-09 17:56 UTC (permalink / raw)
  To: Ansuel Smith
  Cc: netdev@vger.kernel.org, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli
In-Reply-To: <61b24089.1c69fb81.c8040.164b@mx.google.com>

On Thu, Dec 09, 2021 at 06:44:38PM +0100, Ansuel Smith wrote:
> > I think the problem is that we also need to track the operstate of the
> > master (netif_oper_up via NETDEV_CHANGE) before declaring it as good to go.
> > You can see that this is exactly the line after which the timeouts disappear:
> > 
> > [    7.146901] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> > 
> > I didn't really want to go there, because now I'm not sure how to
> > synthesize the information for the switch drivers to consume it.
> > Anyway I've prepared a v2 patchset and I'll send it out very soon.
> 
> Wonder if we should leave the driver decide when it's ready by parsing
> the different state? (And change
> the up ops to something like a generic change?)

There isn't just one state to track, which is precisely the problem that
I had to deal with for v2. The master is operational during the time
frame between NETDEV_UP and NETDEV_GOING_DOWN, intersected with the
interval during which netif_oper_up(master) is true. So in the simple
state propagation approach, DSA would need to provide at least two ops
to switches, one for admin state and the other for oper state. And the
switch driver would need to AND the two and keep state by itself.
Letting the driver make the decision would have been acceptable to me if
we could have 3 ops and a common implementation, something like this:

static void qca8k_master_state_change(struct dsa_switch *ds,
				      const struct dsa_master *master)
{
	bool operational = (master->flags & IFF_UP) && netif_oper_up(master);
}

	.master_admin_state_change	= qca8k_master_state_change,
	.master_oper_state_change	= qca8k_master_state_change,

but the problem is that during NETDEV_GOING_DOWN, master->flags & IFF_UP
is still true, so this wouldn't work. And replacing the NETDEV_GOING_DOWN
notifier with the NETDEV_DOWN one would solve that problem, but it would
no longer guarantee that the switch can disable this feature without
timeouts before the master is down - because now it _is_ down.

^ permalink raw reply

* Re: [PATCH v1 bpf 1/1] libbpf: don't force user-supplied ifname string to be of fixed size
From: Emmanuel Deloget @ 2021-12-09 17:55 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Networking, bpf, open list
In-Reply-To: <CAEf4BzYJ+GPpjcMMYQM_BfQ1-aq6dz_JbF-m5meiCZ=oPbrM=w@mail.gmail.com>

Hello,

On 09/12/2021 18:17, Andrii Nakryiko wrote:
> On Thu, Dec 9, 2021 at 4:03 AM Emmanuel Deloget
> <emmanuel.deloget@eho.link> wrote:
>>
>> When calling either xsk_socket__create_shared() or xsk_socket__create()
>> the user supplies a const char *ifname which is implicitely supposed to
>> be a pointer to the start of a char[IFNAMSIZ] array. The internal
>> function xsk_create_ctx() then blindly copy IFNAMSIZ bytes from this
>> string into the xsk context.
>>
>> This is counter-intuitive and error-prone.
>>
>> For example,
>>
>>          int r = xsk_socket__create(..., "eth0", ...)
>>
>> may result in an invalid object because of the blind copy. The "eth0"
>> string might be followed by random data from the ro data section,
>> resulting in ctx->ifname being filled with the correct interface name
>> then a bunch and invalid bytes.
>>
>> The same kind of issue arises when the ifname string is located on the
>> stack:
>>
>>          char ifname[] = "eth0";
>>          int r = xsk_socket__create(..., ifname, ...);
>>
>> Or comes from the command line
>>
>>          const char *ifname = argv[n];
>>          int r = xsk_socket__create(..., ifname, ...);
>>
>> In both case we'll fill ctx->ifname with random data from the stack.
>>
>> In practice, we saw that this issue caused various small errors which,
>> in then end, prevented us to setup a valid xsk context that would have
>> allowed us to capture packets on our interfaces. We fixed this issue in
>> our code by forcing our char ifname[] to be of size IFNAMSIZ but that felt
>> weird and unnecessary.
> 
> I might be missing something, but the eth0 example above would include
> terminating zero at the right place, so ifname will still have
> "eth0\0" which is a valid string. Yes there will be some garbage after
> that, but it shouldn't matter. It could cause ASAN to complain about
> reading beyond allocated memory, of course, but I'm curious what
> problems you actually ran into in practice.

I cannot be extremely precise on what was happening as I did not 
investigate past this (and this fixes our issue) but I suspect that 
having weird bytes in ctx->ifname polutes ifr.ifr_name as initialized in 
xsk_get_max_queues(). ioctl(SIOCETHTOOL) was then giving us an error. 
Now, I haven't looked how the kernel implements this ioctl() so I'm not 
going to say that there is a problem here as well.

And since the issue is now about 2 weeks old it's now a bit murky - and 
I don't have much time to put myself in the same setup in order to 
produce a better investigation (sorry for that).

>>
>> Fixes: 2f6324a3937f8 (libbpf: Support shared umems between queues and devices)
>> Signed-off-by: Emmanuel Deloget <emmanuel.deloget@eho.link>
>> ---
>>   tools/lib/bpf/xsk.c | 7 +++++--
>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
>> index 81f8fbc85e70..8dda80bcefcc 100644
>> --- a/tools/lib/bpf/xsk.c
>> +++ b/tools/lib/bpf/xsk.c
>> @@ -944,6 +944,7 @@ static struct xsk_ctx *xsk_create_ctx(struct xsk_socket *xsk,
>>   {
>>          struct xsk_ctx *ctx;
>>          int err;
>> +       size_t ifnamlen;
>>
>>          ctx = calloc(1, sizeof(*ctx));
>>          if (!ctx)
>> @@ -965,8 +966,10 @@ static struct xsk_ctx *xsk_create_ctx(struct xsk_socket *xsk,
>>          ctx->refcount = 1;
>>          ctx->umem = umem;
>>          ctx->queue_id = queue_id;
>> -       memcpy(ctx->ifname, ifname, IFNAMSIZ - 1);
>> -       ctx->ifname[IFNAMSIZ - 1] = '\0';
>> +
>> +       ifnamlen = strnlen(ifname, IFNAMSIZ);
>> +       memcpy(ctx->ifname, ifname, ifnamlen);
> 
> maybe use strncpy instead of strnlen + memcpy? keep the guaranteed
> zero termination (and keep '\0', why did you change it?)

Well, strncpy() calls were replaced by memcpy() a while ago (see 
3015b500ae42 (libbpf: Use memcpy instead of strncpy to please GCC) for 
example but there are a few other examples ; most of the changes were 
made to please gcc8) so I thought that it would be a bad idea :). What 
would be the consensus on this?

Regarding '\0', I'll change that.

> Also, note that xsk.c is deprecated in libbpf and has been moved into
> libxdp, so please contribute a similar fix there.

Will do.

>> +       ctx->ifname[IFNAMSIZ - 1] = 0;
>>
>>          ctx->fill = fill;
>>          ctx->comp = comp;
>> --
>> 2.32.0
>>

BTW, is there a reason why this patch failed to pass the bpf/vmtest-bpf 
test on patchwork?

Best regards,

-- Emmanuel Deloget

^ permalink raw reply

* Re: [RFC PATCH net-next 0/7] DSA master state tracking
From: Ansuel Smith @ 2021-12-09 17:44 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev@vger.kernel.org, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli
In-Reply-To: <20211209173316.qkm5kuroli7nmtwd@skbuf>

On Thu, Dec 09, 2021 at 05:33:17PM +0000, Vladimir Oltean wrote:
> On Thu, Dec 09, 2021 at 03:44:14PM +0100, Ansuel Smith wrote:
> > On Thu, Dec 09, 2021 at 02:28:30PM +0000, Vladimir Oltean wrote:
> > > On Thu, Dec 09, 2021 at 04:05:59AM +0100, Ansuel Smith wrote:
> > > > On Thu, Dec 09, 2021 at 12:32:23AM +0200, Vladimir Oltean wrote:
> > > > > This patch set is provided solely for review purposes (therefore not to
> > > > > be applied anywhere) and for Ansuel to test whether they resolve the
> > > > > slowdown reported here:
> > > > > https://patchwork.kernel.org/project/netdevbpf/cover/20211207145942.7444-1-ansuelsmth@gmail.com/
> > > > > 
> > > > > It does conflict with net-next due to other patches that are in my tree,
> > > > > and which were also posted here and would need to be picked ("Rework DSA
> > > > > bridge TX forwarding offload API"):
> > > > > https://patchwork.kernel.org/project/netdevbpf/cover/20211206165758.1553882-1-vladimir.oltean@nxp.com/
> > > > > 
> > > > > Additionally, for Ansuel's work there is also a logical dependency with
> > > > > this series ("Replace DSA dp->priv with tagger-owned storage"):
> > > > > https://patchwork.kernel.org/project/netdevbpf/cover/20211208200504.3136642-1-vladimir.oltean@nxp.com/
> > > > > 
> > > > > To get both dependency series, the following commands should be sufficient:
> > > > > git b4 20211206165758.1553882-1-vladimir.oltean@nxp.com
> > > > > git b4 20211208200504.3136642-1-vladimir.oltean@nxp.com
> > > > > 
> > > > > where "git b4" is an alias in ~/.gitconfig:
> > > > > [b4]
> > > > > 	midmask = https://lore.kernel.org/r/%25s
> > > > > [alias]
> > > > > 	b4 = "!f() { b4 am -t -o - $@ | git am -3; }; f"
> > > > > 
> > > > > The patches posted here are mainly to offer a consistent
> > > > > "master_up"/"master_going_down" chain of events to switches, without
> > > > > duplicates, and always starting with "master_up" and ending with
> > > > > "master_going_down". This way, drivers should know when they can perform
> > > > > Ethernet-based register access.
> > > > > 
> > > > > Vladimir Oltean (7):
> > > > >   net: dsa: only bring down user ports assigned to a given DSA master
> > > > >   net: dsa: refactor the NETDEV_GOING_DOWN master tracking into separate
> > > > >     function
> > > > >   net: dsa: use dsa_tree_for_each_user_port in
> > > > >     dsa_tree_master_going_down()
> > > > >   net: dsa: provide switch operations for tracking the master state
> > > > >   net: dsa: stop updating master MTU from master.c
> > > > >   net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}
> > > > >   net: dsa: replay master state events in
> > > > >     dsa_tree_{setup,teardown}_master
> > > > > 
> > > > >  include/net/dsa.h  |  8 +++++++
> > > > >  net/dsa/dsa2.c     | 52 ++++++++++++++++++++++++++++++++++++++++++++--
> > > > >  net/dsa/dsa_priv.h | 11 ++++++++++
> > > > >  net/dsa/master.c   | 29 +++-----------------------
> > > > >  net/dsa/slave.c    | 32 +++++++++++++++-------------
> > > > >  net/dsa/switch.c   | 29 ++++++++++++++++++++++++++
> > > > >  6 files changed, 118 insertions(+), 43 deletions(-)
> > > > > 
> > > > > -- 
> > > > > 2.25.1
> > > > > 
> > > > 
> > > > I applied this patch and it does work correctly. Sadly the problem is
> > > > not solved and still the packet are not tracked correctly. What I notice
> > > > is that everything starts to work as soon as the master is set to
> > > > promiiscuous mode. Wonder if we should track that event instead of
> > > > simple up?
> > > > 
> > > > Here is a bootlog [0]. I added some log when the function timeouts and when
> > > > master up is actually called.
> > > > Current implementation for this is just a bool that is set to true on
> > > > master up and false on master going down. (final version should use
> > > > locking to check if an Ethernet transation is in progress)
> > > > 
> > > > [0] https://pastebin.com/7w2kgG7a
> > > 
> > > This is strange. What MAC DA do the ack packets have? Could you give us
> > > a pcap with the request and reply packets (not necessarily now)?
> > 
> > If you want I can give you a pcap from a router bootup to the setup with
> > no ethernet cable attached. I notice the switch sends some packet at the
> > bootup for some reason but they are not Ethernet mdio packet or other
> > type. It seems they are not even tagged (doesn't have qca tag) as the
> > header mode is disabled by default)
> > Let me know if you need just a pcap for the Ethernet mdio transaction or
> > from a bootup. I assume it would be better from a bootup? (they are not
> > tons of packet and the mdio Ethernet ones are easy to notice.)
> 
> Anything that contains some request and response packets should do, as
> long as they're relatively easy to spot. But as stated, this can wait
> for a while, I don't think that promiscuity is the issue, after your
> second reply.
>

Ok will send a pcap. Any preferred way to send it?

> > > Can you try to set ".promisc_on_master = true" in qca_netdev_ops?
> > 
> > I already tried and here [0] is a log. I notice with promisc_on_master
> > the "eth0 entered promiscuous mode" is missing. Is that correct?
> > Unless I was tired and misread the code, the info should be printed
> > anyway. Also looking at the comments for promisc_on_master I don't think
> > that should be applied to this tagger.
> > 
> > [0] https://pastebin.com/MN2ttVpr
> 
> It isn't missing, it's right there on line 11.

Oww didn't notice that!

> I think the problem is that we also need to track the operstate of the
> master (netif_oper_up via NETDEV_CHANGE) before declaring it as good to go.
> You can see that this is exactly the line after which the timeouts disappear:
> 
> [    7.146901] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> 
> I didn't really want to go there, because now I'm not sure how to
> synthesize the information for the switch drivers to consume it.
> Anyway I've prepared a v2 patchset and I'll send it out very soon.

Wonder if we should leave the driver decide when it's ready by parsing
the different state? (And change
the up ops to something like a generic change?)

-- 
	Ansuel

^ permalink raw reply

* [RFC PATCH v2 net-next 2/4] net: dsa: stop updating master MTU from master.c
From: Vladimir Oltean @ 2021-12-09 17:39 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, Ansuel Smith
In-Reply-To: <20211209173927.4179375-1-vladimir.oltean@nxp.com>

The dev_set_mtu() call from dsa_master_setup() has been effectively
superseded by the dsa_slave_change_mtu(slave_dev, ETH_DATA_LEN) that is
done from dsa_slave_create() for each user port. This function also
updates the master MTU according to the largest user port MTU from the
tree. Therefore, updating the master MTU through a separate code path
isn't needed.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/dsa/master.c | 25 +------------------------
 1 file changed, 1 insertion(+), 24 deletions(-)

diff --git a/net/dsa/master.c b/net/dsa/master.c
index e8e19857621b..f4efb244f91d 100644
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@ -330,28 +330,13 @@ static const struct attribute_group dsa_group = {
 	.attrs	= dsa_slave_attrs,
 };
 
-static void dsa_master_reset_mtu(struct net_device *dev)
-{
-	int err;
-
-	rtnl_lock();
-	err = dev_set_mtu(dev, ETH_DATA_LEN);
-	if (err)
-		netdev_dbg(dev,
-			   "Unable to reset MTU to exclude DSA overheads\n");
-	rtnl_unlock();
-}
-
 static struct lock_class_key dsa_master_addr_list_lock_key;
 
 int dsa_master_setup(struct net_device *dev, struct dsa_port *cpu_dp)
 {
-	const struct dsa_device_ops *tag_ops = cpu_dp->tag_ops;
 	struct dsa_switch *ds = cpu_dp->ds;
 	struct device_link *consumer_link;
-	int mtu, ret;
-
-	mtu = ETH_DATA_LEN + dsa_tag_protocol_overhead(tag_ops);
+	int ret;
 
 	/* The DSA master must use SET_NETDEV_DEV for this to work. */
 	consumer_link = device_link_add(ds->dev, dev->dev.parent,
@@ -361,13 +346,6 @@ int dsa_master_setup(struct net_device *dev, struct dsa_port *cpu_dp)
 			   "Failed to create a device link to DSA switch %s\n",
 			   dev_name(ds->dev));
 
-	rtnl_lock();
-	ret = dev_set_mtu(dev, mtu);
-	rtnl_unlock();
-	if (ret)
-		netdev_warn(dev, "error %d setting MTU to %d to include DSA overhead\n",
-			    ret, mtu);
-
 	/* If we use a tagging format that doesn't have an ethertype
 	 * field, make sure that all packets from this point on get
 	 * sent to the tag format's receive function.
@@ -405,7 +383,6 @@ void dsa_master_teardown(struct net_device *dev)
 	sysfs_remove_group(&dev->dev.kobj, &dsa_group);
 	dsa_netdev_ops_set(dev, NULL);
 	dsa_master_ethtool_teardown(dev);
-	dsa_master_reset_mtu(dev);
 	dsa_master_set_promiscuity(dev, -1);
 
 	dev->dsa_ptr = NULL;
-- 
2.25.1


^ permalink raw reply related

* [RFC PATCH v2 net-next 4/4] net: dsa: replay master state events in dsa_tree_{setup,teardown}_master
From: Vladimir Oltean @ 2021-12-09 17:39 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, Ansuel Smith
In-Reply-To: <20211209173927.4179375-1-vladimir.oltean@nxp.com>

In order for switch driver to be able to make simple and reliable use of
the master tracking operations, they must also be notified of the
initial state of the DSA master, not just of the changes. This is
because they might enable certain features only during the time when
they know that the DSA master is up and running.

Therefore, this change explicitly checks the state of the DSA master
under the same rtnl_mutex as we were holding during the
dsa_master_setup() and dsa_master_teardown() call. The idea being that
if the DSA master became operational in between the moment in which it
became a DSA master (dsa_master_setup set dev->dsa_ptr) and the moment
when we checked for master->flags & IFF_UP, there is a chance that we
would emit a ->master_up() event twice. We need to avoid that by
serializing the concurrent netdevice event with us. If the netdevice
event started before, we force it to finish before we begin, because we
take rtnl_lock before making netdev_uses_dsa() return true. So we also
handle that early event and do nothing on it. Similarly, if the
dev_open() attempt is concurrent with us, it will attempt to take the
rtnl_mutex, but we're holding it. We'll see that the master flag IFF_UP
isn't set, then when we release the rtnl_mutex we'll process the
NETDEV_UP notifier.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/dsa/dsa2.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 6d4422c9e334..c86c9688e8cc 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -1019,9 +1019,17 @@ static int dsa_tree_setup_master(struct dsa_switch_tree *dst)
 
 	list_for_each_entry(dp, &dst->ports, list) {
 		if (dsa_port_is_cpu(dp)) {
-			err = dsa_master_setup(dp->master, dp);
+			struct net_device *master = dp->master;
+
+			err = dsa_master_setup(master, dp);
 			if (err)
 				return err;
+
+			/* Replay master state event */
+			dsa_tree_master_admin_state_change(dst, master,
+							   master->flags & IFF_UP);
+			dsa_tree_master_oper_state_change(dst, master,
+							  netif_oper_up(master));
 		}
 	}
 
@@ -1036,9 +1044,19 @@ static void dsa_tree_teardown_master(struct dsa_switch_tree *dst)
 
 	rtnl_lock();
 
-	list_for_each_entry(dp, &dst->ports, list)
-		if (dsa_port_is_cpu(dp))
-			dsa_master_teardown(dp->master);
+	list_for_each_entry(dp, &dst->ports, list) {
+		if (dsa_port_is_cpu(dp)) {
+			struct net_device *master = dp->master;
+
+			/* Synthesizing an "admin down" state is sufficient for
+			 * the switches to get a notification if the master is
+			 * currently up and running.
+			 */
+			dsa_tree_master_admin_state_change(dst, master, false);
+
+			dsa_master_teardown(master);
+		}
+	}
 
 	rtnl_unlock();
 }
-- 
2.25.1


^ permalink raw reply related

* [RFC PATCH v2 net-next 3/4] net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}
From: Vladimir Oltean @ 2021-12-09 17:39 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, Ansuel Smith
In-Reply-To: <20211209173927.4179375-1-vladimir.oltean@nxp.com>

DSA needs to simulate master tracking events when a binding is first
with a DSA master established and torn down, in order to give drivers
the simplifying guarantee that ->master_up and ->master_going_down calls
are made in exactly this order. To avoid races, we need to block the
reception of NETDEV_UP/NETDEV_GOING_DOWN events in the netdev notifier
chain while we are changing the master's dev->dsa_ptr (this changes what
netdev_uses_dsa(dev) reports).

The dsa_master_setup() and dsa_master_teardown() functions optionally
require the rtnl_mutex to be held, if the tagger needs the master to be
promiscuous, these functions call dev_set_promiscuity(). Move the
rtnl_lock() from that function and make it top-level.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/dsa/dsa2.c   | 8 ++++++++
 net/dsa/master.c | 4 ++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index a6cb3470face..6d4422c9e334 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -1015,6 +1015,8 @@ static int dsa_tree_setup_master(struct dsa_switch_tree *dst)
 	struct dsa_port *dp;
 	int err;
 
+	rtnl_lock();
+
 	list_for_each_entry(dp, &dst->ports, list) {
 		if (dsa_port_is_cpu(dp)) {
 			err = dsa_master_setup(dp->master, dp);
@@ -1023,6 +1025,8 @@ static int dsa_tree_setup_master(struct dsa_switch_tree *dst)
 		}
 	}
 
+	rtnl_unlock();
+
 	return 0;
 }
 
@@ -1030,9 +1034,13 @@ static void dsa_tree_teardown_master(struct dsa_switch_tree *dst)
 {
 	struct dsa_port *dp;
 
+	rtnl_lock();
+
 	list_for_each_entry(dp, &dst->ports, list)
 		if (dsa_port_is_cpu(dp))
 			dsa_master_teardown(dp->master);
+
+	rtnl_unlock();
 }
 
 static int dsa_tree_setup_lags(struct dsa_switch_tree *dst)
diff --git a/net/dsa/master.c b/net/dsa/master.c
index f4efb244f91d..2199104ca7df 100644
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@ -267,9 +267,9 @@ static void dsa_master_set_promiscuity(struct net_device *dev, int inc)
 	if (!ops->promisc_on_master)
 		return;
 
-	rtnl_lock();
+	ASSERT_RTNL();
+
 	dev_set_promiscuity(dev, inc);
-	rtnl_unlock();
 }
 
 static ssize_t tagging_show(struct device *d, struct device_attribute *attr,
-- 
2.25.1


^ permalink raw reply related

* [RFC PATCH v2 net-next 1/4] net: dsa: provide switch operations for tracking the master state
From: Vladimir Oltean @ 2021-12-09 17:39 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, Ansuel Smith
In-Reply-To: <20211209173927.4179375-1-vladimir.oltean@nxp.com>

Certain drivers may need to send management traffic to the switch for
things like register access, FDB dump, etc, to accelerate what their
slow bus (SPI, I2C, MDIO) can already do.

Ethernet is faster (especially in bulk transactions) but is also more
unreliable, since the user may decide to bring the DSA master down (or
not bring it up), therefore severing the link between the host and the
attached switch.

Drivers needing Ethernet-based register access already should have
fallback logic to the slow bus if the Ethernet method fails, but that
fallback may be based on a timeout, and the I/O to the switch may slow
down to a halt if the master is down, because every Ethernet packet will
have to time out. The driver also doesn't have the option to turn off
Ethernet-based I/O momentarily, because it wouldn't know when to turn it
back on.

Which is where this change comes in. By tracking NETDEV_CHANGE,
NETDEV_UP and NETDEV_GOING_DOWN events on the DSA master, we should know
the exact interval of time during which this interface is reliably
available for traffic. Provide this information to switches so they can
use it as they wish.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/net/dsa.h  | 11 +++++++++++
 net/dsa/dsa2.c     | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 net/dsa/dsa_priv.h | 13 +++++++++++++
 net/dsa/slave.c    | 27 +++++++++++++++++++++++++++
 net/dsa/switch.c   | 15 +++++++++++++++
 5 files changed, 112 insertions(+)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index bdf308a5c55e..8690b9c6d674 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -296,6 +296,10 @@ struct dsa_port {
 	struct list_head	fdbs;
 	struct list_head	mdbs;
 
+	/* Master state bits, valid only on CPU ports */
+	u8 master_admin_up:1,
+	   master_oper_up:1;
+
 	bool setup;
 };
 
@@ -1011,6 +1015,13 @@ struct dsa_switch_ops {
 	int	(*tag_8021q_vlan_add)(struct dsa_switch *ds, int port, u16 vid,
 				      u16 flags);
 	int	(*tag_8021q_vlan_del)(struct dsa_switch *ds, int port, u16 vid);
+
+	/*
+	 * DSA master tracking operations
+	 */
+	void	(*master_state_change)(struct dsa_switch *ds,
+				       const struct net_device *master,
+				       bool operational);
 };
 
 #define DSA_DEVLINK_PARAM_DRIVER(_id, _name, _type, _cmodes)		\
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 8814fa0e44c8..a6cb3470face 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -1187,6 +1187,52 @@ int dsa_tree_change_tag_proto(struct dsa_switch_tree *dst,
 	return err;
 }
 
+static void dsa_tree_master_state_change(struct dsa_switch_tree *dst,
+					 struct net_device *master)
+{
+	struct dsa_notifier_master_state_info info;
+	struct dsa_port *cpu_dp = master->dsa_ptr;
+
+	info.master = master;
+	info.operational = cpu_dp->master_admin_up && cpu_dp->master_oper_up;
+
+	dsa_tree_notify(dst, DSA_NOTIFIER_MASTER_STATE_CHANGE, &info);
+}
+
+void dsa_tree_master_admin_state_change(struct dsa_switch_tree *dst,
+					struct net_device *master,
+					bool up)
+{
+	struct dsa_port *cpu_dp = master->dsa_ptr;
+	bool notify = false;
+
+	if ((cpu_dp->master_admin_up && cpu_dp->master_oper_up) !=
+	    (up && cpu_dp->master_oper_up))
+		notify = true;
+
+	cpu_dp->master_admin_up = up;
+
+	if (notify)
+		dsa_tree_master_state_change(dst, master);
+}
+
+void dsa_tree_master_oper_state_change(struct dsa_switch_tree *dst,
+				       struct net_device *master,
+				       bool up)
+{
+	struct dsa_port *cpu_dp = master->dsa_ptr;
+	bool notify = false;
+
+	if ((cpu_dp->master_admin_up && cpu_dp->master_oper_up) !=
+	    (cpu_dp->master_admin_up && up))
+		notify = true;
+
+	cpu_dp->master_oper_up = up;
+
+	if (notify)
+		dsa_tree_master_state_change(dst, master);
+}
+
 static struct dsa_port *dsa_port_touch(struct dsa_switch *ds, int index)
 {
 	struct dsa_switch_tree *dst = ds->dst;
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 38ce5129a33d..c47864446adc 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -43,6 +43,7 @@ enum {
 	DSA_NOTIFIER_MRP_DEL_RING_ROLE,
 	DSA_NOTIFIER_TAG_8021Q_VLAN_ADD,
 	DSA_NOTIFIER_TAG_8021Q_VLAN_DEL,
+	DSA_NOTIFIER_MASTER_STATE_CHANGE,
 };
 
 /* DSA_NOTIFIER_AGEING_TIME */
@@ -126,6 +127,12 @@ struct dsa_notifier_tag_8021q_vlan_info {
 	u16 vid;
 };
 
+/* DSA_NOTIFIER_MASTER_STATE_CHANGE */
+struct dsa_notifier_master_state_info {
+	const struct net_device *master;
+	bool operational;
+};
+
 struct dsa_switchdev_event_work {
 	struct dsa_switch *ds;
 	int port;
@@ -506,6 +513,12 @@ int dsa_tree_change_tag_proto(struct dsa_switch_tree *dst,
 			      struct net_device *master,
 			      const struct dsa_device_ops *tag_ops,
 			      const struct dsa_device_ops *old_tag_ops);
+void dsa_tree_master_admin_state_change(struct dsa_switch_tree *dst,
+					struct net_device *master,
+					bool up);
+void dsa_tree_master_oper_state_change(struct dsa_switch_tree *dst,
+				       struct net_device *master,
+				       bool up);
 unsigned int dsa_bridge_num_get(const struct net_device *bridge_dev, int max);
 void dsa_bridge_num_put(const struct net_device *bridge_dev,
 			unsigned int bridge_num);
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 2b153b366118..9f3b25c08c13 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -2349,6 +2349,31 @@ static int dsa_slave_netdevice_event(struct notifier_block *nb,
 		err = dsa_port_lag_change(dp, info->lower_state_info);
 		return notifier_from_errno(err);
 	}
+	case NETDEV_CHANGE: {
+		if (netdev_uses_dsa(dev)) {
+			struct dsa_port *cpu_dp = dev->dsa_ptr;
+			struct dsa_switch_tree *dst = cpu_dp->ds->dst;
+
+			dsa_tree_master_oper_state_change(dst, dev,
+							  netif_oper_up(dev));
+
+			return NOTIFY_OK;
+		}
+
+		return NOTIFY_DONE;
+	}
+	case NETDEV_UP: {
+		if (netdev_uses_dsa(dev)) {
+			struct dsa_port *cpu_dp = dev->dsa_ptr;
+			struct dsa_switch_tree *dst = cpu_dp->ds->dst;
+
+			dsa_tree_master_admin_state_change(dst, dev, true);
+
+			return NOTIFY_OK;
+		}
+
+		return NOTIFY_DONE;
+	}
 	case NETDEV_GOING_DOWN: {
 		struct dsa_port *dp, *cpu_dp;
 		struct dsa_switch_tree *dst;
@@ -2360,6 +2385,8 @@ static int dsa_slave_netdevice_event(struct notifier_block *nb,
 		cpu_dp = dev->dsa_ptr;
 		dst = cpu_dp->ds->dst;
 
+		dsa_tree_master_admin_state_change(dst, dev, false);
+
 		list_for_each_entry(dp, &dst->ports, list) {
 			if (!dsa_port_is_user(dp))
 				continue;
diff --git a/net/dsa/switch.c b/net/dsa/switch.c
index 9c92edd96961..78816a6805c8 100644
--- a/net/dsa/switch.c
+++ b/net/dsa/switch.c
@@ -699,6 +699,18 @@ dsa_switch_mrp_del_ring_role(struct dsa_switch *ds,
 	return 0;
 }
 
+static int
+dsa_switch_master_state_change(struct dsa_switch *ds,
+			       struct dsa_notifier_master_state_info *info)
+{
+	if (!ds->ops->master_state_change)
+		return 0;
+
+	ds->ops->master_state_change(ds, info->master, info->operational);
+
+	return 0;
+}
+
 static int dsa_switch_event(struct notifier_block *nb,
 			    unsigned long event, void *info)
 {
@@ -784,6 +796,9 @@ static int dsa_switch_event(struct notifier_block *nb,
 	case DSA_NOTIFIER_TAG_8021Q_VLAN_DEL:
 		err = dsa_switch_tag_8021q_vlan_del(ds, info);
 		break;
+	case DSA_NOTIFIER_MASTER_STATE_CHANGE:
+		err = dsa_switch_master_state_change(ds, info);
+		break;
 	default:
 		err = -EOPNOTSUPP;
 		break;
-- 
2.25.1


^ permalink raw reply related

* [RFC PATCH v2 net-next 0/4] DSA master state tracking
From: Vladimir Oltean @ 2021-12-09 17:39 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jakub Kicinski, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, Ansuel Smith

This patch set is provided solely for review purposes (therefore not to
be applied anywhere) and for Ansuel to test whether they resolve the
slowdown reported here:
https://patchwork.kernel.org/project/netdevbpf/cover/20211207145942.7444-1-ansuelsmth@gmail.com/

The patches posted here are mainly to offer a consistent
"master_state_change" chain of events to switches, without duplicates,
and always starting with operational=true and ending with
operational=false. This way, drivers should know when they can perform
Ethernet-based register access, and need not care about more than that.

Changes in v2:
- dropped some useless patches
- also check master operstate.

Vladimir Oltean (4):
  net: dsa: provide switch operations for tracking the master state
  net: dsa: stop updating master MTU from master.c
  net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}
  net: dsa: replay master state events in
    dsa_tree_{setup,teardown}_master

 include/net/dsa.h  | 11 +++++++
 net/dsa/dsa2.c     | 80 +++++++++++++++++++++++++++++++++++++++++++---
 net/dsa/dsa_priv.h | 13 ++++++++
 net/dsa/master.c   | 29 ++---------------
 net/dsa/slave.c    | 27 ++++++++++++++++
 net/dsa/switch.c   | 15 +++++++++
 6 files changed, 145 insertions(+), 30 deletions(-)

-- 
2.25.1


^ permalink raw reply

* Re: [PATCH v4 net-next 2/9] i40e: respect metadata on XSK Rx to skb
From: Alexander Lobakin @ 2021-12-09 17:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Lobakin, intel-wired-lan, brouer, Jesse Brandeburg,
	Tony Nguyen, David S. Miller, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Björn Töpel, Maciej Fijalkowski, Michal Swiatkowski,
	Jithu Joseph, Martin KaFai Lau, Song Liu, KP Singh, Yonghong Song,
	Andrii Nakryiko, netdev, linux-kernel, bpf
In-Reply-To: <2811b35a-9179-88ce-d87a-e1f824851494@redhat.com>

From: Jesper Dangaard Brouer <jbrouer@redhat.com>
Date: Thu, 9 Dec 2021 09:27:37 +0100

> On 08/12/2021 15.06, Alexander Lobakin wrote:
> > For now, if the XDP prog returns XDP_PASS on XSK, the metadata will
> > be lost as it doesn't get copied to the skb.
> 
> I have an urge to add a newline here, when reading this, as IMHO it is a 
> paragraph with the problem statement.
> 
> > Copy it along with the frame headers. Account its size on skb
> > allocation, and when copying just treat it as a part of the frame
> > and do a pull after to "move" it to the "reserved" zone.
> 
> Also newline here, as next paragraph are some extra details, you felt a 
> need to explain to the reader.
> 
> > net_prefetch() xdp->data_meta and align the copy size to speed-up
> > memcpy() a little and better match i40e_costruct_skb().
>                                       ^^^^^^xx^^^^^^^^^
> 
> You have a general misspelling of this function name in all of your 
> commit messages.

Oh gosh, I thought I don't have attention deficit. Thanks, maybe
Tony will fix it for me or I could send a follow-up (or resend if
needed, I saw those were already applied to dev-queue).

> 
> --Jesper

Al

^ permalink raw reply

* [PATCH net-next v1] net: dsa: mv88e6xxx: Trap PTP traffic
From: Kurt Kanzenbach @ 2021-12-09 17:33 UTC (permalink / raw)
  To: Andrew Lunn, Vivien Didelot
  Cc: Florian Fainelli, Vladimir Oltean, David S. Miller,
	Jakub Kicinski, netdev, Richard Cochran, Kurt Kanzenbach

A time aware switch should trap PTP traffic to the management CPU. User space
daemons such as ptp4l will process these messages to implement Boundary (or
Transparent) Clocks.

At the moment the mv88e6xxx driver for mv88e6341 doesn't trap these messages
which leads to confusion when multiple end devices are connected to the
switch. Therefore, setup PTP traps. Leverage the already implemented policy
mechanism based on destination addresses. Configure the traps only if
timestamping is enabled so that non time aware use case is still functioning.

Introduce an additional PTP operation in case other devices need special
handling in regards to trapping as well.

Tested on Marvell Topaz (mv88e6341) switch with multiple end devices connected
like this:

|# DSA setup
|$ ip link set eth0 up
|$ ip link set lan0 up
|$ ip link set lan1 up
|$ ip link set lan2 up
|$ ip link add name br0 type bridge
|$ ip link set dev lan0 master br0
|$ ip link set dev lan1 master br0
|$ ip link set dev lan2 master br0
|$ ip link set lan0 up
|$ ip link set lan1 up
|$ ip link set lan2 up
|$ ip link set br0 up
|$ dhclient br0
|# Configure bridge routing
|$ ebtables --table broute --append BROUTING --protocol 0x88F7 --jump DROP
|# Start linuxptp
|$ ptp4l -H -2 -i lan0 -i lan1 -i lan2 --tx_timestamp_timeout=40 -m

Verified added policies with mv88e6xxx_dump.

Signed-off-by: Kurt Kanzenbach <kurt@kmk-computers.de>
---
 drivers/net/dsa/mv88e6xxx/chip.c     | 12 +++---
 drivers/net/dsa/mv88e6xxx/chip.h     |  5 +++
 drivers/net/dsa/mv88e6xxx/hwtstamp.c |  7 ++++
 drivers/net/dsa/mv88e6xxx/ptp.c      | 59 ++++++++++++++++++++++++++++
 drivers/net/dsa/mv88e6xxx/ptp.h      |  2 +
 5 files changed, 80 insertions(+), 5 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 7fadbf987b23..ab50ebd85f1f 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -1816,8 +1816,8 @@ static int mv88e6xxx_port_db_load_purge(struct mv88e6xxx_chip *chip, int port,
 	return mv88e6xxx_g1_atu_loadpurge(chip, fid, &entry);
 }
 
-static int mv88e6xxx_policy_apply(struct mv88e6xxx_chip *chip, int port,
-				  const struct mv88e6xxx_policy *policy)
+int mv88e6xxx_policy_apply(struct mv88e6xxx_chip *chip, int port,
+			   const struct mv88e6xxx_policy *policy)
 {
 	enum mv88e6xxx_policy_mapping mapping = policy->mapping;
 	enum mv88e6xxx_policy_action action = policy->action;
@@ -1835,10 +1835,12 @@ static int mv88e6xxx_policy_apply(struct mv88e6xxx_chip *chip, int port,
 	case MV88E6XXX_POLICY_MAPPING_SA:
 		if (action == MV88E6XXX_POLICY_ACTION_NORMAL)
 			state = 0; /* Dissociate the port and address */
-		else if (action == MV88E6XXX_POLICY_ACTION_DISCARD &&
+		else if ((action == MV88E6XXX_POLICY_ACTION_DISCARD ||
+			  action == MV88E6XXX_POLICY_ACTION_TRAP) &&
 			 is_multicast_ether_addr(addr))
 			state = MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC_POLICY;
-		else if (action == MV88E6XXX_POLICY_ACTION_DISCARD &&
+		else if ((action == MV88E6XXX_POLICY_ACTION_DISCARD ||
+			  action == MV88E6XXX_POLICY_ACTION_TRAP) &&
 			 is_unicast_ether_addr(addr))
 			state = MV88E6XXX_G1_ATU_DATA_STATE_UC_STATIC_POLICY;
 		else
@@ -4589,7 +4591,7 @@ static const struct mv88e6xxx_ops mv88e6341_ops = {
 	.serdes_irq_status = mv88e6390_serdes_irq_status,
 	.gpio_ops = &mv88e6352_gpio_ops,
 	.avb_ops = &mv88e6390_avb_ops,
-	.ptp_ops = &mv88e6352_ptp_ops,
+	.ptp_ops = &mv88e6341_ptp_ops,
 	.serdes_get_sset_count = mv88e6390_serdes_get_sset_count,
 	.serdes_get_strings = mv88e6390_serdes_get_strings,
 	.serdes_get_stats = mv88e6390_serdes_get_stats,
diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h
index 8271b8aa7b71..795ae5a56834 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.h
+++ b/drivers/net/dsa/mv88e6xxx/chip.h
@@ -673,6 +673,8 @@ struct mv88e6xxx_ptp_ops {
 	int (*port_disable)(struct mv88e6xxx_chip *chip, int port);
 	int (*global_enable)(struct mv88e6xxx_chip *chip);
 	int (*global_disable)(struct mv88e6xxx_chip *chip);
+	int (*setup_ptp_traps)(struct mv88e6xxx_chip *chip, int port,
+			       bool enable);
 	int n_ext_ts;
 	int arr0_sts_reg;
 	int arr1_sts_reg;
@@ -760,4 +762,7 @@ static inline void mv88e6xxx_reg_unlock(struct mv88e6xxx_chip *chip)
 
 int mv88e6xxx_fid_map(struct mv88e6xxx_chip *chip, unsigned long *bitmap);
 
+int mv88e6xxx_policy_apply(struct mv88e6xxx_chip *chip, int port,
+			   const struct mv88e6xxx_policy *policy);
+
 #endif /* _MV88E6XXX_CHIP_H */
diff --git a/drivers/net/dsa/mv88e6xxx/hwtstamp.c b/drivers/net/dsa/mv88e6xxx/hwtstamp.c
index 8f74ffc7a279..617aeb6cbaac 100644
--- a/drivers/net/dsa/mv88e6xxx/hwtstamp.c
+++ b/drivers/net/dsa/mv88e6xxx/hwtstamp.c
@@ -94,6 +94,7 @@ static int mv88e6xxx_set_hwtstamp_config(struct mv88e6xxx_chip *chip, int port,
 	const struct mv88e6xxx_ptp_ops *ptp_ops = chip->info->ops->ptp_ops;
 	struct mv88e6xxx_port_hwtstamp *ps = &chip->port_hwtstamp[port];
 	bool tstamp_enable = false;
+	int ret;
 
 	/* Prevent the TX/RX paths from trying to interact with the
 	 * timestamp hardware while we reconfigure it.
@@ -161,6 +162,12 @@ static int mv88e6xxx_set_hwtstamp_config(struct mv88e6xxx_chip *chip, int port,
 		if (chip->enable_count == 0 && ptp_ops->global_disable)
 			ptp_ops->global_disable(chip);
 	}
+
+	if (ptp_ops->setup_ptp_traps) {
+		ret = ptp_ops->setup_ptp_traps(chip, port, tstamp_enable);
+		if (tstamp_enable && ret)
+			dev_warn(chip->dev, "Failed to setup PTP traps. PTP might not work as desired!\n");
+	}
 	mv88e6xxx_reg_unlock(chip);
 
 	/* Once hardware has been configured, enable timestamp checks
diff --git a/drivers/net/dsa/mv88e6xxx/ptp.c b/drivers/net/dsa/mv88e6xxx/ptp.c
index d838c174dc0d..8d6ff03d37c8 100644
--- a/drivers/net/dsa/mv88e6xxx/ptp.c
+++ b/drivers/net/dsa/mv88e6xxx/ptp.c
@@ -345,6 +345,37 @@ static int mv88e6352_ptp_verify(struct ptp_clock_info *ptp, unsigned int pin,
 	return 0;
 }
 
+static int mv88e6341_setup_ptp_traps(struct mv88e6xxx_chip *chip, int port,
+				     bool enable)
+{
+	static const u8 ptp_destinations[][ETH_ALEN] = {
+		{ 0x01, 0x1b, 0x19, 0x00, 0x00, 0x00 }, /* L2 PTP */
+		{ 0x01, 0x80, 0xc2, 0x00, 0x00, 0x0e }, /* L2 P2P */
+		{ 0x01, 0x00, 0x5e, 0x00, 0x01, 0x81 }, /* IPv4 PTP */
+		{ 0x01, 0x00, 0x5e, 0x00, 0x00, 0x6b }, /* IPv4 P2P */
+		{ 0x33, 0x33, 0x00, 0x00, 0x01, 0x81 }, /* IPv6 PTP */
+		{ 0x33, 0x33, 0x00, 0x00, 0x00, 0x6b }, /* IPv6 P2P */
+	};
+	int ret, i;
+
+	for (i = 0; i < ARRAY_SIZE(ptp_destinations); ++i) {
+		struct mv88e6xxx_policy policy = { };
+
+		policy.mapping	= MV88E6XXX_POLICY_MAPPING_DA;
+		policy.action	= enable ? MV88E6XXX_POLICY_ACTION_TRAP :
+			MV88E6XXX_POLICY_ACTION_NORMAL;
+		policy.port	= port;
+		policy.vid	= 0;
+		ether_addr_copy(policy.addr, ptp_destinations[i]);
+
+		ret = mv88e6xxx_policy_apply(chip, port, &policy);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 const struct mv88e6xxx_ptp_ops mv88e6165_ptp_ops = {
 	.clock_read = mv88e6165_ptp_clock_read,
 	.global_enable = mv88e6165_global_enable,
@@ -419,6 +450,34 @@ const struct mv88e6xxx_ptp_ops mv88e6352_ptp_ops = {
 	.cc_mult_dem = MV88E6XXX_CC_MULT_DEM,
 };
 
+const struct mv88e6xxx_ptp_ops mv88e6341_ptp_ops = {
+	.clock_read = mv88e6352_ptp_clock_read,
+	.ptp_enable = mv88e6352_ptp_enable,
+	.ptp_verify = mv88e6352_ptp_verify,
+	.event_work = mv88e6352_tai_event_work,
+	.port_enable = mv88e6352_hwtstamp_port_enable,
+	.port_disable = mv88e6352_hwtstamp_port_disable,
+	.setup_ptp_traps = mv88e6341_setup_ptp_traps,
+	.n_ext_ts = 1,
+	.arr0_sts_reg = MV88E6XXX_PORT_PTP_ARR0_STS,
+	.arr1_sts_reg = MV88E6XXX_PORT_PTP_ARR1_STS,
+	.dep_sts_reg = MV88E6XXX_PORT_PTP_DEP_STS,
+	.rx_filters = (1 << HWTSTAMP_FILTER_NONE) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_L4_EVENT) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_L4_SYNC) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_L2_EVENT) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_L2_SYNC) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_EVENT) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_SYNC) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_DELAY_REQ),
+	.cc_shift = MV88E6XXX_CC_SHIFT,
+	.cc_mult = MV88E6XXX_CC_MULT,
+	.cc_mult_num = MV88E6XXX_CC_MULT_NUM,
+	.cc_mult_dem = MV88E6XXX_CC_MULT_DEM,
+};
+
 static u64 mv88e6xxx_ptp_clock_read(const struct cyclecounter *cc)
 {
 	struct mv88e6xxx_chip *chip = cc_to_chip(cc);
diff --git a/drivers/net/dsa/mv88e6xxx/ptp.h b/drivers/net/dsa/mv88e6xxx/ptp.h
index 269d5d16a466..badcb72d10a6 100644
--- a/drivers/net/dsa/mv88e6xxx/ptp.h
+++ b/drivers/net/dsa/mv88e6xxx/ptp.h
@@ -151,6 +151,7 @@ void mv88e6xxx_ptp_free(struct mv88e6xxx_chip *chip);
 extern const struct mv88e6xxx_ptp_ops mv88e6165_ptp_ops;
 extern const struct mv88e6xxx_ptp_ops mv88e6250_ptp_ops;
 extern const struct mv88e6xxx_ptp_ops mv88e6352_ptp_ops;
+extern const struct mv88e6xxx_ptp_ops mv88e6341_ptp_ops;
 
 #else /* !CONFIG_NET_DSA_MV88E6XXX_PTP */
 
@@ -171,6 +172,7 @@ static inline void mv88e6xxx_ptp_free(struct mv88e6xxx_chip *chip)
 static const struct mv88e6xxx_ptp_ops mv88e6165_ptp_ops = {};
 static const struct mv88e6xxx_ptp_ops mv88e6250_ptp_ops = {};
 static const struct mv88e6xxx_ptp_ops mv88e6352_ptp_ops = {};
+static const struct mv88e6xxx_ptp_ops mv88e6341_ptp_ops = {};
 
 #endif /* CONFIG_NET_DSA_MV88E6XXX_PTP */
 
-- 
2.34.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox