Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [syzbot] KASAN: use-after-free Read in put_pmu_ctx
From: syzbot @ 2022-12-19 19:56 UTC (permalink / raw)
  To: acme, alexander.shishkin, bpf, jolsa, linux-kernel,
	linux-perf-users, mark.rutland, mingo, namhyung, netdev, peterz,
	sdf, syzkaller-bugs
In-Reply-To: <Y6C8iQGENUk/XY/A@google.com>

Hello,

syzbot tried to test the proposed patch but the build/boot failed:

failed to apply patch:
checking file kernel/events/core.c
patch: **** unexpected end of file in patch

Tested on:

commit:         13e3c779 Merge tag 'for-netdev' of https://git.kernel...
git tree:       https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git
dashboard link: https://syzkaller.appspot.com/bug?extid=b8e8c01c8ade4fe6e48f
compiler:       
patch:          https://syzkaller.appspot.com/x/patch.diff?x=15861a9f880000

^ permalink raw reply

* Re: [syzbot] KASAN: use-after-free Read in put_pmu_ctx
From: sdf @ 2022-12-19 19:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: syzbot, acme, alexander.shishkin, bpf, jolsa, linux-kernel,
	linux-perf-users, mark.rutland, mingo, namhyung, netdev,
	syzkaller-bugs
In-Reply-To: <Y6B3xEgkbmFUCeni@hirez.programming.kicks-ass.net>

On 12/19, Peter Zijlstra wrote:
> On Mon, Dec 19, 2022 at 12:04:43AM -0800, syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:    13e3c7793e2f Merge tag 'for-netdev' of  
> https://git.kernel...
> > git tree:       bpf
> > console+strace: https://syzkaller.appspot.com/x/log.txt?x=177df7e0480000
> > kernel config:   
> https://syzkaller.appspot.com/x/.config?x=b0e91ad4b5f69c47
> > dashboard link:  
> https://syzkaller.appspot.com/bug?extid=b8e8c01c8ade4fe6e48f
> > compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU  
> Binutils for Debian) 2.35.2
> > syz repro:       
> https://syzkaller.appspot.com/x/repro.syz?x=15e87100480000
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=16ceeb13880000
> >
> > Downloadable assets:
> > disk image:  
> https://storage.googleapis.com/syzbot-assets/373a99daa295/disk-13e3c779.raw.xz
> > vmlinux:  
> https://storage.googleapis.com/syzbot-assets/7fa71ed0fe17/vmlinux-13e3c779.xz
> > kernel image:  
> https://storage.googleapis.com/syzbot-assets/2842ad5c698b/bzImage-13e3c779.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the  
> commit:
> > Reported-by: syzbot+b8e8c01c8ade4fe6e48f@syzkaller.appspotmail.com
> >
> > ==================================================================
> > BUG: KASAN: use-after-free in __lock_acquire+0x3ee7/0x56d0  
> kernel/locking/lockdep.c:4925
> > Read of size 8 at addr ffff8880237d6018 by task syz-executor287/8300
> >
> > CPU: 0 PID: 8300 Comm: syz-executor287 Not tainted  
> 6.1.0-syzkaller-09661-g13e3c7793e2f #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
> Google 10/26/2022
> > Call Trace:
> >  <TASK>
> >  __dump_stack lib/dump_stack.c:88 [inline]
> >  dump_stack_lvl+0xd1/0x138 lib/dump_stack.c:106
> >  print_address_description mm/kasan/report.c:284 [inline]
> >  print_report+0x15e/0x45d mm/kasan/report.c:395
> >  kasan_report+0xbf/0x1f0 mm/kasan/report.c:495
> >  __lock_acquire+0x3ee7/0x56d0 kernel/locking/lockdep.c:4925
> >  lock_acquire kernel/locking/lockdep.c:5668 [inline]
> >  lock_acquire+0x1e3/0x630 kernel/locking/lockdep.c:5633
> >  __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
> >  _raw_spin_lock_irqsave+0x3d/0x60 kernel/locking/spinlock.c:162
> >  put_pmu_ctx kernel/events/core.c:4913 [inline]
> >  put_pmu_ctx+0xad/0x390 kernel/events/core.c:4893
> >  _free_event+0x3c5/0x13d0 kernel/events/core.c:5196
> >  free_event+0x58/0xc0 kernel/events/core.c:5224
> >  __do_sys_perf_event_open+0x66d/0x2980 kernel/events/core.c:12701
> >  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> >  do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
> >  entry_SYSCALL_64_after_hwframe+0x63/0xcd

> Does this help?

Let's maybe try it this way:

#syz test: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git  
13e3c7793e2f

diff --git a/kernel/events/core.c b/kernel/events/core.c
index e47914ac8732..bbff551783e1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12689,7 +12689,8 @@ SYSCALL_DEFINE5(perf_event_open,
  	return event_fd;

  err_context:
-	/* event->pmu_ctx freed by free_event() */
+	put_pmu_ctx(event->pmu_ctx);
+	event->pmu_ctx = NULL; /* _free_event() */
  err_locked:
  	mutex_unlock(&ctx->mutex);
  	perf_unpin_context(ctx);

^ permalink raw reply related

* Re: [RFC PATCH] mm: remove zap_page_range and change callers to use zap_vma_page_range
From: Mike Kravetz @ 2022-12-19 19:22 UTC (permalink / raw)
  To: Michal Hocko, Jérôme Glisse
  Cc: linux-mm, linux-kernel, linuxppc-dev, linux-riscv, linux-s390,
	netdev, David Hildenbrand, Peter Xu, Nadav Amit, Matthew Wilcox,
	Vlastimil Babka, Rik van Riel, Will Deacon, Michael Ellerman,
	Palmer Dabbelt, Christian Borntraeger, Dave Hansen,
	Christian Brauner, Eric Dumazet, Andrew Morton
In-Reply-To: <Y6A6KqXObGKxvDrX@dhcp22.suse.cz>

On 12/19/22 13:06, Michal Hocko wrote:
> On Fri 16-12-22 11:20:12, Mike Kravetz wrote:
> > zap_page_range was originally designed to unmap pages within an address
> > range that could span multiple vmas.  While working on [1], it was
> > discovered that all callers of zap_page_range pass a range entirely within
> > a single vma.  In addition, the mmu notification call within zap_page
> > range does not correctly handle ranges that span multiple vmas as calls
> > should be vma specific.
> 
> Could you spend a sentence or two explaining what is wrong here?

Hmmmm?  My assumption was that the range passed to mmu_notifier_range_init()
was supposed to be within the specified vma.  When looking into the notifier
routines, I could not find any documentation about the usage of the vma within
the mmu_notifier_range structure.  It was introduced with commit bf198b2b34bf
"mm/mmu_notifier: pass down vma and reasons why mmu notifier is happening".
However, I do not see this being used today.

Of course, I could be missing something, so adding Jérôme.

> 
> > Instead of fixing zap_page_range, change all callers to use the new
> > routine zap_vma_page_range.  zap_vma_page_range is just a wrapper around
> > zap_page_range_single passing in NULL zap details.  The name is also
> > more in line with other exported routines that operate within a vma.
> > We can then remove zap_page_range.
> 
> I would stick with zap_page_range_single rather than adding a new
> wrapper but nothing really critical.

I am fine with doing that as well.  My only reason for the wrapper is that all 
callers outside mm/memory.c would pass in NULL zap details.

> 
> > Also, change madvise_dontneed_single_vma to use this new routine.
> > 
> > [1] https://lore.kernel.org/linux-mm/20221114235507.294320-2-mike.kravetz@oracle.com/
> > Suggested-by: Peter Xu <peterx@redhat.com>
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Other than that LGTM
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Thanks!

Thanks for taking a look.
-- 
Mike Kravetz

^ permalink raw reply

* Re: [PATCH bpf-next 2/2] selftests/bpf: Add BPF_F_NO_TUNNEL_KEY test
From: sdf @ 2022-12-19 18:41 UTC (permalink / raw)
  To: Christian Ehrig
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Mykola Lysenko, Shuah Khan,
	Joanne Koong, Kui-Feng Lee, Kumar Kartikeya Dwivedi,
	Maxim Mikityanskiy, Kaixi Fan, Paul Chaignon, Shmulik Ladkani,
	linux-kernel, netdev, linux-kselftest
In-Reply-To: <20221218051734.31411-2-cehrig@cloudflare.com>

On 12/18, Christian Ehrig wrote:
> This patch adds a selftest simulating a GRE sender and receiver using
> tunnel headers without tunnel keys. It validates if packets encapsulated
> using BPF_F_NO_TUNNEL_KEY are decapsulated by a GRE receiver not
> configured with tunnel keys.

> Signed-off-by: Christian Ehrig <cehrig@cloudflare.com>

Acked-by: Stanislav Fomichev <sdf@google.com>

> ---
>   .../selftests/bpf/progs/test_tunnel_kern.c    | 21 ++++++++++
>   tools/testing/selftests/bpf/test_tunnel.sh    | 40 +++++++++++++++++--
>   2 files changed, 58 insertions(+), 3 deletions(-)

> diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c  
> b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> index 98af55f0bcd3..508da4a23c4f 100644
> --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> @@ -81,6 +81,27 @@ int gre_set_tunnel(struct __sk_buff *skb)
>   	return TC_ACT_OK;
>   }

> +SEC("tc")
> +int gre_set_tunnel_no_key(struct __sk_buff *skb)
> +{
> +	int ret;
> +	struct bpf_tunnel_key key;
> +
> +	__builtin_memset(&key, 0x0, sizeof(key));
> +	key.remote_ipv4 = 0xac100164; /* 172.16.1.100 */
> +	key.tunnel_ttl = 64;
> +
> +	ret = bpf_skb_set_tunnel_key(skb, &key, sizeof(key),
> +				     BPF_F_ZERO_CSUM_TX | BPF_F_SEQ_NUMBER |
> +				     BPF_F_NO_TUNNEL_KEY);
> +	if (ret < 0) {
> +		log_err(ret);
> +		return TC_ACT_SHOT;
> +	}
> +
> +	return TC_ACT_OK;
> +}
> +
>   SEC("tc")
>   int gre_get_tunnel(struct __sk_buff *skb)
>   {
> diff --git a/tools/testing/selftests/bpf/test_tunnel.sh  
> b/tools/testing/selftests/bpf/test_tunnel.sh
> index 2eaedc1d9ed3..06857b689c11 100755
> --- a/tools/testing/selftests/bpf/test_tunnel.sh
> +++ b/tools/testing/selftests/bpf/test_tunnel.sh
> @@ -66,15 +66,20 @@ config_device()

>   add_gre_tunnel()
>   {
> +	tun_key=
> +	if [ -n "$1" ]; then
> +		tun_key="key $1"
> +	fi
> +
>   	# at_ns0 namespace
>   	ip netns exec at_ns0 \
> -        ip link add dev $DEV_NS type $TYPE seq key 2 \
> +        ip link add dev $DEV_NS type $TYPE seq $tun_key \
>   		local 172.16.1.100 remote 172.16.1.200
>   	ip netns exec at_ns0 ip link set dev $DEV_NS up
>   	ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24

>   	# root namespace
> -	ip link add dev $DEV type $TYPE key 2 external
> +	ip link add dev $DEV type $TYPE $tun_key external
>   	ip link set dev $DEV up
>   	ip addr add dev $DEV 10.1.1.200/24
>   }
> @@ -238,7 +243,7 @@ test_gre()

>   	check $TYPE
>   	config_device
> -	add_gre_tunnel
> +	add_gre_tunnel 2
>   	attach_bpf $DEV gre_set_tunnel gre_get_tunnel
>   	ping $PING_ARG 10.1.1.100
>   	check_err $?
> @@ -253,6 +258,30 @@ test_gre()
>           echo -e ${GREEN}"PASS: $TYPE"${NC}
>   }

> +test_gre_no_tunnel_key()
> +{
> +	TYPE=gre
> +	DEV_NS=gre00
> +	DEV=gre11
> +	ret=0
> +
> +	check $TYPE
> +	config_device
> +	add_gre_tunnel
> +	attach_bpf $DEV gre_set_tunnel_no_key gre_get_tunnel
> +	ping $PING_ARG 10.1.1.100
> +	check_err $?
> +	ip netns exec at_ns0 ping $PING_ARG 10.1.1.200
> +	check_err $?
> +	cleanup
> +
> +        if [ $ret -ne 0 ]; then
> +                echo -e ${RED}"FAIL: $TYPE"${NC}
> +                return 1
> +        fi
> +        echo -e ${GREEN}"PASS: $TYPE"${NC}
> +}
> +
>   test_ip6gre()
>   {
>   	TYPE=ip6gre
> @@ -589,6 +618,7 @@ cleanup()
>   	ip link del ipip6tnl11 2> /dev/null
>   	ip link del ip6ip6tnl11 2> /dev/null
>   	ip link del gretap11 2> /dev/null
> +	ip link del gre11 2> /dev/null
>   	ip link del ip6gre11 2> /dev/null
>   	ip link del ip6gretap11 2> /dev/null
>   	ip link del geneve11 2> /dev/null
> @@ -641,6 +671,10 @@ bpf_tunnel_test()
>   	test_gre
>   	errors=$(( $errors + $? ))

> +	echo "Testing GRE tunnel (without tunnel keys)..."
> +	test_gre_no_tunnel_key
> +	errors=$(( $errors + $? ))
> +
>   	echo "Testing IP6GRE tunnel..."
>   	test_ip6gre
>   	errors=$(( $errors + $? ))
> --
> 2.37.4


^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: Add flag BPF_F_NO_TUNNEL_KEY to bpf_skb_set_tunnel_key()
From: sdf @ 2022-12-19 18:41 UTC (permalink / raw)
  To: Christian Ehrig
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Mykola Lysenko, Shuah Khan,
	Joanne Koong, Kui-Feng Lee, Maxim Mikityanskiy, Kaixi Fan,
	Shmulik Ladkani, Paul Chaignon, linux-kernel, netdev,
	linux-kselftest
In-Reply-To: <20221218051734.31411-1-cehrig@cloudflare.com>

On 12/18, Christian Ehrig wrote:
> This patch allows to remove TUNNEL_KEY from the tunnel flags bitmap
> when using bpf_skb_set_tunnel_key by providing a BPF_F_NO_TUNNEL_KEY
> flag. On egress, the resulting tunnel header will not contain a tunnel
> key if the protocol and implementation supports it.

> At the moment bpf_tunnel_key wants a user to specify a numeric tunnel
> key. This will wrap the inner packet into a tunnel header with the key
> bit and value set accordingly. This is problematic when using a tunnel
> protocol that supports optional tunnel keys and a receiving tunnel
> device that is not expecting packets with the key bit set. The receiver
> won't decapsulate and drop the packet.

> RFC 2890 and RFC 2784 GRE tunnels are examples where this flag is
> useful. It allows for generating packets, that can be decapsulated by
> a GRE tunnel device not operating in collect metadata mode or not
> expecting the key bit set.

> Signed-off-by: Christian Ehrig <cehrig@cloudflare.com>

Acked-by: Stanislav Fomichev <sdf@google.com>

> ---
>   include/uapi/linux/bpf.h       | 4 ++++
>   net/core/filter.c              | 5 ++++-
>   tools/include/uapi/linux/bpf.h | 4 ++++
>   3 files changed, 12 insertions(+), 1 deletion(-)

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 464ca3f01fe7..bc1a3d232ae4 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2001,6 +2001,9 @@ union bpf_attr {
>    * 			sending the packet. This flag was added for GRE
>    * 			encapsulation, but might be used with other protocols
>    * 			as well in the future.
> + * 		**BPF_F_NO_TUNNEL_KEY**
> + * 			Add a flag to tunnel metadata indicating that no tunnel
> + * 			key should be set in the resulting tunnel header.
>    *
>    * 		Here is a typical usage on the transmit path:
>    *
> @@ -5764,6 +5767,7 @@ enum {
>   	BPF_F_ZERO_CSUM_TX		= (1ULL << 1),
>   	BPF_F_DONT_FRAGMENT		= (1ULL << 2),
>   	BPF_F_SEQ_NUMBER		= (1ULL << 3),
> +	BPF_F_NO_TUNNEL_KEY		= (1ULL << 4),
>   };

>   /* BPF_FUNC_skb_get_tunnel_key flags. */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 929358677183..c746e4d77214 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4615,7 +4615,8 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff  
> *, skb,
>   	struct ip_tunnel_info *info;

>   	if (unlikely(flags & ~(BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX |
> -			       BPF_F_DONT_FRAGMENT | BPF_F_SEQ_NUMBER)))
> +			       BPF_F_DONT_FRAGMENT | BPF_F_SEQ_NUMBER |
> +			       BPF_F_NO_TUNNEL_KEY)))
>   		return -EINVAL;
>   	if (unlikely(size != sizeof(struct bpf_tunnel_key))) {
>   		switch (size) {
> @@ -4653,6 +4654,8 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff  
> *, skb,
>   		info->key.tun_flags &= ~TUNNEL_CSUM;
>   	if (flags & BPF_F_SEQ_NUMBER)
>   		info->key.tun_flags |= TUNNEL_SEQ;
> +	if (flags & BPF_F_NO_TUNNEL_KEY)
> +		info->key.tun_flags &= ~TUNNEL_KEY;

>   	info->key.tun_id = cpu_to_be64(from->tunnel_id);
>   	info->key.tos = from->tunnel_tos;
> diff --git a/tools/include/uapi/linux/bpf.h  
> b/tools/include/uapi/linux/bpf.h
> index 464ca3f01fe7..bc1a3d232ae4 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -2001,6 +2001,9 @@ union bpf_attr {
>    * 			sending the packet. This flag was added for GRE
>    * 			encapsulation, but might be used with other protocols
>    * 			as well in the future.
> + * 		**BPF_F_NO_TUNNEL_KEY**
> + * 			Add a flag to tunnel metadata indicating that no tunnel
> + * 			key should be set in the resulting tunnel header.
>    *
>    * 		Here is a typical usage on the transmit path:
>    *
> @@ -5764,6 +5767,7 @@ enum {
>   	BPF_F_ZERO_CSUM_TX		= (1ULL << 1),
>   	BPF_F_DONT_FRAGMENT		= (1ULL << 2),
>   	BPF_F_SEQ_NUMBER		= (1ULL << 3),
> +	BPF_F_NO_TUNNEL_KEY		= (1ULL << 4),
>   };

>   /* BPF_FUNC_skb_get_tunnel_key flags. */
> --
> 2.37.4


^ permalink raw reply

* Re: [PATCH v7] igb: Assign random MAC address instead of fail in case of invalid one
From: Jacob Keller @ 2022-12-19 18:25 UTC (permalink / raw)
  To: Alexander Duyck, Heiner Kallweit
  Cc: Jakub Kicinski, Leon Romanovsky, Lixue Liang, anthony.l.nguyen,
	linux-kernel, jesse.brandeburg, davem, edumazet, pabeni, netdev,
	lianglixue
In-Reply-To: <CAKgT0UdZUmFD8qYdF6k47ZmRgB0jS-graOYgB5Se+y+4fKqW8w@mail.gmail.com>

On 12/14/2022 3:17 PM, Alexander Duyck wrote:
> On Wed, Dec 14, 2022 at 1:43 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> As far as the discussion for module parameter vs something else. The
> problem with the driver is that it is monolithic so it isn't as if
> there is a devlink interface to configure a per-network parameter
> before the network portion is loaded. The module parameter is a
> compromise that should only be used to enable the workaround so that
> the driver can be loaded so that the EEPROM can then be edited. If
> anything, tying the EEPROM to ethtool is the issue. If somebody wants
> to look at providing an option to edit the EEPROM via devlink that
> would solve the issue as then the driver could expose the devlink
> interface to edit the EEPROM without having to allocate and register a
> netdev.

Right. Since igb only has netdev as the way to change the MAC or edit
the EEPROM, there is no way to fix this if the driver doesn't load.

Ideally the driver would instead still be able to register and expose an
interface (devlink?) that does not depend on the MAC, and when the MAC
is invalid it just would not register a netdevice. Another option might
me some method of a netdev to say "I don't have a valid MAC, so don't
allow me to do traffic"?

A global sysctl policy as discussed elsewhere in the thread also seems
reasonable to me, given that modifying the driver to expose devlink is a
lot more work. Of course we'd also need to check and get other drivers
to use the same global policy as well.

Thanks,
Jake

^ permalink raw reply

* Re: [PATCH net 1/3] Documentation: devlink: add missing toc entry for etas_es58x devlink doc
From: Jacob Keller @ 2022-12-19 18:14 UTC (permalink / raw)
  To: Marc Kleine-Budde, netdev
  Cc: davem, kuba, linux-can, kernel, Vincent Mailhol, Stephen Rothwell,
	kernel test robot
In-Reply-To: <20221219155210.1143439-2-mkl@pengutronix.de>



On 12/19/2022 7:52 AM, Marc Kleine-Budde wrote:
> From: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
> 
> toc entry is missing for etas_es58x devlink doc and triggers this warning:
> 
>   Documentation/networking/devlink/etas_es58x.rst: WARNING: document isn't included in any toctree
> 
> Add the missing toc entry.
> 
> Fixes: 9f63f96aac92 ("Documentation: devlink: add devlink documentation for the etas_es58x driver")
> Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
> Link: https://lore.kernel.org/all/20221213051136.721887-1-mailhol.vincent@wanadoo.fr
> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
> Reported-by: kernel test robot <lkp@intel.com>
> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
> ---

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

>  Documentation/networking/devlink/index.rst | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
> index 4b653d040627..fee4d3968309 100644
> --- a/Documentation/networking/devlink/index.rst
> +++ b/Documentation/networking/devlink/index.rst
> @@ -50,6 +50,7 @@ parameters, info versions, and other features it supports.
>     :maxdepth: 1
>  
>     bnxt
> +   etas_es58x
>     hns3
>     ionic
>     ice
> 
> base-commit: 2856a62762c8409e360d4fd452194c8e57ba1058

^ permalink raw reply

* Re: [RFC net-next 07/10] netdevsim: rename a label
From: Jacob Keller @ 2022-12-19 18:01 UTC (permalink / raw)
  To: Jakub Kicinski, jiri, leon; +Cc: netdev
In-Reply-To: <20221217011953.152487-8-kuba@kernel.org>



On 12/16/2022 5:19 PM, Jakub Kicinski wrote:
> err_dl_unregister should unregister the devlink instance.
> Looks like renaming it was missed in one of the reshufflings.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Makes sense.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

> ---
>  drivers/net/netdevsim/dev.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
> index b962fc8e1397..d25f6e86d901 100644
> --- a/drivers/net/netdevsim/dev.c
> +++ b/drivers/net/netdevsim/dev.c
> @@ -1563,7 +1563,7 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
>  	err = devlink_params_register(devlink, nsim_devlink_params,
>  				      ARRAY_SIZE(nsim_devlink_params));
>  	if (err)
> -		goto err_dl_unregister;
> +		goto err_resource_unregister;
>  	nsim_devlink_set_params_init_values(nsim_dev, devlink);
>  
>  	err = nsim_dev_dummy_region_init(nsim_dev, devlink);
> @@ -1629,7 +1629,7 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
>  err_params_unregister:
>  	devlink_params_unregister(devlink, nsim_devlink_params,
>  				  ARRAY_SIZE(nsim_devlink_params));
> -err_dl_unregister:
> +err_resource_unregister:
>  	devl_resources_unregister(devlink);
>  err_vfc_free:
>  	kfree(nsim_dev->vfconfigs);

^ permalink raw reply

* Re: [RFC net-next 05/10] devlink: remove the registration guarantee of references
From: Jacob Keller @ 2022-12-19 17:56 UTC (permalink / raw)
  To: Jakub Kicinski, jiri, leon; +Cc: netdev
In-Reply-To: <20221217011953.152487-6-kuba@kernel.org>



On 12/16/2022 5:19 PM, Jakub Kicinski wrote:
> The objective of exposing the devlink instance locks to
> drivers was to let them use these locks to prevent user space
> from accessing the device before it's fully initialized.
> This is difficult because devlink_unregister() waits for all
> references to be released, meaning that devlink_unregister()
> can't itself be called under the instance lock.
> 

Sure.

> To avoid this issue devlink_register() was moved after subobject
> registration a while ago. Unfortunately the netdev paths get
> a hold of the devlink instances _before_ they are registered.
> Ideally netdev should wait for devlink init to finish (synchronizing
> on the instance lock). This can't work because we don't know if the
> instance will _ever_ be registered (in case of failures it may not).
> The other option of returning an error until devlink_register()
> is called is unappealing (user space would get a notification
> netdev exist but would have to wait arbitrary amount of time
> before accessing some of its attributes).
> 

Nice summary of the problems and options that we have tried already.

I think its also important as this can allow sub objects to be
registered after the devlink instance?

> Weaken the guarantees of the devlink references.
> 
> Holding a reference will now only guarantee that the memory
> of the object is around. Another way of looking at it is that
> the reference now protects the object not its "registered" status.
> Use devlink instance lock to synchronize unregistration.
> 

Right, this makes sense.

> This implies that releasing of the "main" reference of the devlink
> instance moves from devlink_unregister() to devlink_free().
> 

This makes sense and I think aligns more with how most references work
in practice. Good.

> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Code change seems straight forward enough. I had a minor question, but:

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

> ---
>  include/net/devlink.h       |  2 ++
>  net/devlink/core.c          | 64 ++++++++++++++++---------------------
>  net/devlink/devl_internal.h |  2 --
>  3 files changed, 30 insertions(+), 38 deletions(-)
> 
> diff --git a/include/net/devlink.h b/include/net/devlink.h
> index 36e013d3aa52..cc910612b3f4 100644
> --- a/include/net/devlink.h
> +++ b/include/net/devlink.h
> @@ -1648,6 +1648,8 @@ static inline struct devlink *devlink_alloc(const struct devlink_ops *ops,
>  	return devlink_alloc_ns(ops, priv_size, &init_net, dev);
>  }
>  void devlink_set_features(struct devlink *devlink, u64 features);
> +int devl_register(struct devlink *devlink);
> +void devl_unregister(struct devlink *devlink);
>  void devlink_register(struct devlink *devlink);
>  void devlink_unregister(struct devlink *devlink);
>  void devlink_free(struct devlink *devlink);
> diff --git a/net/devlink/core.c b/net/devlink/core.c
> index 2abad8247597..413b92534ad6 100644
> --- a/net/devlink/core.c
> +++ b/net/devlink/core.c
> @@ -89,21 +89,10 @@ struct devlink *__must_check devlink_try_get(struct devlink *devlink)
>  	return NULL;
>  }
>  
> -static void __devlink_put_rcu(struct rcu_head *head)
> -{
> -	struct devlink *devlink = container_of(head, struct devlink, rcu);
> -
> -	complete(&devlink->comp);
> -}
> -
>  void devlink_put(struct devlink *devlink)
>  {
>  	if (refcount_dec_and_test(&devlink->refcount))
> -		/* Make sure unregister operation that may await the completion
> -		 * is unblocked only after all users are after the end of
> -		 * RCU grace period.
> -		 */
> -		call_rcu(&devlink->rcu, __devlink_put_rcu);
> +		kfree_rcu(devlink, rcu);
>  }
>  
>  struct devlink *devlinks_xa_find_get(struct net *net, unsigned long *indexp)
> @@ -116,13 +105,6 @@ struct devlink *devlinks_xa_find_get(struct net *net, unsigned long *indexp)
>  	if (!devlink)
>  		goto unlock;
>  
> -	/* In case devlink_unregister() was already called and "unregistering"
> -	 * mark was set, do not allow to get a devlink reference here.
> -	 * This prevents live-lock of devlink_unregister() wait for completion.
> -	 */
> -	if (xa_get_mark(&devlinks, *indexp, DEVLINK_UNREGISTERING))
> -		goto next;
> -
>  	if (!devlink_try_get(devlink))
>  		goto next;
>  	if (!net_eq(devlink_net(devlink), net)) {
> @@ -158,37 +140,48 @@ void devlink_set_features(struct devlink *devlink, u64 features)
>  EXPORT_SYMBOL_GPL(devlink_set_features);
>  
>  /**
> - *	devlink_register - Register devlink instance
> - *
> - *	@devlink: devlink
> + * devl_register - Register devlink instance
> + * @devlink: devlink
>   */
> -void devlink_register(struct devlink *devlink)
> +int devl_register(struct devlink *devlink)
>  {
>  	ASSERT_DEVLINK_NOT_REGISTERED(devlink);
> -	/* Make sure that we are in .probe() routine */
> +	devl_assert_locked(devlink);
>  
>  	xa_set_mark(&devlinks, devlink->index, DEVLINK_REGISTERED);
>  	devlink_notify_register(devlink);
> +
> +	return 0;

Any particular reason to change this to int when it doesn't have a
failure case yet? Future patches I assume? You don't check the
devl_register return value.

> +}
> +EXPORT_SYMBOL_GPL(devl_register);
> +
> +void devlink_register(struct devlink *devlink)
> +{
> +	devl_lock(devlink);
> +	devl_register(devlink);
> +	devl_unlock(devlink);
>  }
>  EXPORT_SYMBOL_GPL(devlink_register);
>  
>  /**
> - *	devlink_unregister - Unregister devlink instance
> - *
> - *	@devlink: devlink
> + * devl_unregister - Unregister devlink instance
> + * @devlink: devlink
>   */
> -void devlink_unregister(struct devlink *devlink)
> +void devl_unregister(struct devlink *devlink)
>  {
>  	ASSERT_DEVLINK_REGISTERED(devlink);
> -	/* Make sure that we are in .remove() routine */
> -
> -	xa_set_mark(&devlinks, devlink->index, DEVLINK_UNREGISTERING);
> -	devlink_put(devlink);
> -	wait_for_completion(&devlink->comp);
> +	devl_assert_locked(devlink);
>  
>  	devlink_notify_unregister(devlink);
>  	xa_clear_mark(&devlinks, devlink->index, DEVLINK_REGISTERED);
> -	xa_clear_mark(&devlinks, devlink->index, DEVLINK_UNREGISTERING);
> +}
> +EXPORT_SYMBOL_GPL(devl_unregister);
> +
> +void devlink_unregister(struct devlink *devlink)
> +{
> +	devl_lock(devlink);
> +	devl_unregister(devlink);
> +	devl_unlock(devlink);
>  }
>  EXPORT_SYMBOL_GPL(devlink_unregister);
>  
> @@ -252,7 +245,6 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
>  	mutex_init(&devlink->reporters_lock);
>  	mutex_init(&devlink->linecards_lock);
>  	refcount_set(&devlink->refcount, 1);
> -	init_completion(&devlink->comp);
>  
>  	return devlink;
>  
> @@ -298,7 +290,7 @@ void devlink_free(struct devlink *devlink)
>  
>  	xa_erase(&devlinks, devlink->index);
>  
> -	kfree(devlink);
> +	devlink_put(devlink);
>  }
>  EXPORT_SYMBOL_GPL(devlink_free);
>  
> diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
> index c3977c69552a..7e77eebde3b9 100644
> --- a/net/devlink/devl_internal.h
> +++ b/net/devlink/devl_internal.h
> @@ -12,7 +12,6 @@
>  #include <net/net_namespace.h>
>  
>  #define DEVLINK_REGISTERED XA_MARK_1
> -#define DEVLINK_UNREGISTERING XA_MARK_2
>  
>  #define DEVLINK_RELOAD_STATS_ARRAY_SIZE \
>  	(__DEVLINK_RELOAD_LIMIT_MAX * __DEVLINK_RELOAD_ACTION_MAX)
> @@ -52,7 +51,6 @@ struct devlink {
>  	struct lock_class_key lock_key;
>  	u8 reload_failed:1;
>  	refcount_t refcount;
> -	struct completion comp;
>  	struct rcu_head rcu;
>  	struct notifier_block netdevice_nb;
>  	char priv[] __aligned(NETDEV_ALIGN);

^ permalink raw reply

* Re: [RFC net-next 04/10] devlink: always check if the devlink instance is registered
From: Jacob Keller @ 2022-12-19 17:48 UTC (permalink / raw)
  To: Jakub Kicinski, jiri, leon; +Cc: netdev
In-Reply-To: <20221217011953.152487-5-kuba@kernel.org>



On 12/16/2022 5:19 PM, Jakub Kicinski wrote:
> Always check under the instance lock whether the devlink instance
> is still / already registered.
> 

Ok. So now the reference ensures less about whats valid. It guarantees a
lock but doesn't ensure that the devlink remains registered unless you
acquire the lock and check that the devlink is alive under lock now?

> This is a no-op for the most part, as the unregistration path currently
> waits for all references. On the init path, however, we may temporarily
> open up a race with netdev code, if netdevs are registered before the
> devlink instance. This is temporary, the next change fixes it, and this
> commit has been split out for the ease of review.
> 

This means you're adding the problem here, but its fixed in next commit..?

> Note that in case of iterating over sub-objects which have their
> own lock (regions and line cards) we assume an implicit dependency
> between those objects existing and devlink unregistration.
> 

That seems reasonable.

> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  include/net/devlink.h |  1 +
>  net/devlink/basic.c   | 35 +++++++++++++++++++++++++++++------
>  net/devlink/core.c    | 25 +++++++++++++++++++++----
>  net/devlink/netlink.c | 10 ++++++++--
>  4 files changed, 59 insertions(+), 12 deletions(-)
> 
> diff --git a/include/net/devlink.h b/include/net/devlink.h
> index 6a2e4f21779f..36e013d3aa52 100644
> --- a/include/net/devlink.h
> +++ b/include/net/devlink.h
> @@ -1626,6 +1626,7 @@ struct device *devlink_to_dev(const struct devlink *devlink);
>  void devl_lock(struct devlink *devlink);
>  int devl_trylock(struct devlink *devlink);
>  void devl_unlock(struct devlink *devlink);
> +bool devl_is_alive(struct devlink *devlink);
>  void devl_assert_locked(struct devlink *devlink);
>  bool devl_lock_is_held(struct devlink *devlink);
>  
> diff --git a/net/devlink/basic.c b/net/devlink/basic.c
> index 5f33d74eef83..6b18e70a39fd 100644
> --- a/net/devlink/basic.c
> +++ b/net/devlink/basic.c
> @@ -2130,6 +2130,9 @@ static int devlink_nl_cmd_linecard_get_dumpit(struct sk_buff *msg,
>  		int idx = 0;
>  
>  		mutex_lock(&devlink->linecards_lock);
> +		if (!devl_is_alive(devlink))
> +			goto next_devlink;
> +
>  		list_for_each_entry(linecard, &devlink->linecard_list, list) {
>  			if (idx < dump->idx) {
>  				idx++;
> @@ -2151,6 +2154,7 @@ static int devlink_nl_cmd_linecard_get_dumpit(struct sk_buff *msg,
>  			}
>  			idx++;
>  		}
> +next_devlink:
>  		mutex_unlock(&devlink->linecards_lock);
>  		devlink_put(devlink);
>  	}
> @@ -7809,6 +7813,12 @@ devlink_nl_cmd_health_reporter_get_dumpit(struct sk_buff *msg,
>  		int idx = 0;
>  
>  		mutex_lock(&devlink->reporters_lock);
> +		if (!devl_is_alive(devlink)) {
> +			mutex_unlock(&devlink->reporters_lock);
> +			devlink_put(devlink);
> +			continue;
> +		}
> +
>  		list_for_each_entry(reporter, &devlink->reporter_list,
>  				    list) {
>  			if (idx < dump->idx) {
> @@ -7830,6 +7840,9 @@ devlink_nl_cmd_health_reporter_get_dumpit(struct sk_buff *msg,
>  		mutex_unlock(&devlink->reporters_lock);
>  
>  		devl_lock(devlink);
> +		if (!devl_is_alive(devlink))
> +			goto next_devlink;
> +
>  		xa_for_each(&devlink->ports, port_index, port) {
>  			mutex_lock(&port->reporters_lock);
>  			list_for_each_entry(reporter, &port->reporter_list, list) {
> @@ -7853,6 +7866,7 @@ devlink_nl_cmd_health_reporter_get_dumpit(struct sk_buff *msg,
>  			}
>  			mutex_unlock(&port->reporters_lock);
>  		}
> +next_devlink:
>  		devl_unlock(devlink);
>  		devlink_put(devlink);
>  	}
> @@ -12218,7 +12232,8 @@ void devlink_compat_running_version(struct devlink *devlink,
>  		return;
>  
>  	devl_lock(devlink);
> -	__devlink_compat_running_version(devlink, buf, len);
> +	if (devl_is_alive(devlink))
> +		__devlink_compat_running_version(devlink, buf, len);
>  	devl_unlock(devlink);
>  }
>  
> @@ -12227,20 +12242,28 @@ int devlink_compat_flash_update(struct devlink *devlink, const char *file_name)
>  	struct devlink_flash_update_params params = {};
>  	int ret;
>  
> -	if (!devlink->ops->flash_update)
> -		return -EOPNOTSUPP;
> +	devl_lock(devlink);
> +	if (!devl_is_alive(devlink)) {
> +		ret = -ENODEV;
> +		goto out_unlock;
> +	}
> +
> +	if (!devlink->ops->flash_update) {
> +		ret = -EOPNOTSUPP;
> +		goto out_unlock;
> +	}
>  
>  	ret = request_firmware(&params.fw, file_name, devlink->dev);
>  	if (ret)
> -		return ret;
> +		goto out_unlock;
>  
> -	devl_lock(devlink);
>  	devlink_flash_update_begin_notify(devlink);
>  	ret = devlink->ops->flash_update(devlink, &params, NULL);
>  	devlink_flash_update_end_notify(devlink);
> -	devl_unlock(devlink);
>  
>  	release_firmware(params.fw);
> +out_unlock:
> +	devl_unlock(devlink);
>  
>  	return ret;
>  }
> diff --git a/net/devlink/core.c b/net/devlink/core.c
> index d3b8336946fd..2abad8247597 100644
> --- a/net/devlink/core.c
> +++ b/net/devlink/core.c
> @@ -67,6 +67,21 @@ void devl_unlock(struct devlink *devlink)
>  }
>  EXPORT_SYMBOL_GPL(devl_unlock);
>  
> +bool devl_is_alive(struct devlink *devlink)
> +{
> +	return xa_get_mark(&devlinks, devlink->index, DEVLINK_REGISTERED);
> +}
> +EXPORT_SYMBOL_GPL(devl_is_alive);
> +
> +/**
> + * devlink_try_get() - try to obtain a reference on a devlink instance
> + * @devlink: instance to reference
> + *
> + * Obtain a reference on a devlink instance. A reference on a devlink instance
> + * only implies that it's safe to take the instance lock. It does not imply
> + * that the instance is registered, use devl_is_alive() after taking
> + * the instance lock to check registration status.
> + */
>  struct devlink *__must_check devlink_try_get(struct devlink *devlink)
>  {
>  	if (refcount_inc_not_zero(&devlink->refcount))
> @@ -300,10 +315,12 @@ static void __net_exit devlink_pernet_pre_exit(struct net *net)
>  	devlinks_xa_for_each_registered_get(net, index, devlink) {
>  		WARN_ON(!(devlink->features & DEVLINK_F_RELOAD));
>  		devl_lock(devlink);
> -		err = devlink_reload(devlink, &init_net,
> -				     DEVLINK_RELOAD_ACTION_DRIVER_REINIT,
> -				     DEVLINK_RELOAD_LIMIT_UNSPEC,
> -				     &actions_performed, NULL);
> +		err = 0;
> +		if (devl_is_alive(devlink))
> +			err = devlink_reload(devlink, &init_net,
> +					     DEVLINK_RELOAD_ACTION_DRIVER_REINIT,
> +					     DEVLINK_RELOAD_LIMIT_UNSPEC,
> +					     &actions_performed, NULL);
>  		devl_unlock(devlink);
>  		devlink_put(devlink);
>  
> diff --git a/net/devlink/netlink.c b/net/devlink/netlink.c
> index b38df704be1c..773efaabb6ad 100644
> --- a/net/devlink/netlink.c
> +++ b/net/devlink/netlink.c
> @@ -98,7 +98,8 @@ devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs)
>  
>  	devlinks_xa_for_each_registered_get(net, index, devlink) {
>  		devl_lock(devlink);
> -		if (strcmp(devlink->dev->bus->name, busname) == 0 &&
> +		if (devl_is_alive(devlink) &&
> +		    strcmp(devlink->dev->bus->name, busname) == 0 &&
>  		    strcmp(dev_name(devlink->dev), devname) == 0)
>  			return devlink;
>  		devl_unlock(devlink);
> @@ -210,7 +211,12 @@ int devlink_instance_iter_dump(struct sk_buff *msg, struct netlink_callback *cb)
>  
>  	devlink_dump_for_each_instance_get(msg, dump, devlink) {
>  		devl_lock(devlink);
> -		err = cmd->dump_one(msg, devlink, cb);
> +
> +		if (devl_is_alive(devlink))
> +			err = cmd->dump_one(msg, devlink, cb);
> +		else
> +			err = 0;
> +
>  		devl_unlock(devlink);
>  		devlink_put(devlink);
>  

^ permalink raw reply

* Re: [RFC net-next 00/10] devlink: remove the wait-for-references on unregister
From: Jacob Keller @ 2022-12-19 17:38 UTC (permalink / raw)
  To: Jakub Kicinski, jiri, leon; +Cc: netdev
In-Reply-To: <20221217011953.152487-1-kuba@kernel.org>



On 12/16/2022 5:19 PM, Jakub Kicinski wrote:
> This set is on top of the previous RFC.
> 
> Move the registration and unregistration of the devlink instances
> under their instance locks. Don't perform the netdev-style wait
> for all references when unregistering the instance.
> 

Could you explain the reasoning/benefits here? I'm sure some of this is
explained in each commit message as well but it would help to understand
the overall series.

> Jakub Kicinski (10):
>   devlink: bump the instance index directly when iterating
>   devlink: update the code in netns move to latest helpers
>   devlink: protect devlink->dev by the instance lock
>   devlink: always check if the devlink instance is registered
>   devlink: remove the registration guarantee of references
>   devlink: don't require setting features before registration
>   netdevsim: rename a label
>   netdevsim: move devlink registration under the instance lock
>   devlink: allow registering parameters after the instance
>   netdevsim: register devlink instance before sub-objects
> 
>  drivers/net/netdevsim/dev.c |  15 +++--
>  include/net/devlink.h       |   3 +
>  net/devlink/basic.c         |  64 ++++++++++++------
>  net/devlink/core.c          | 127 +++++++++++++++++-------------------
>  net/devlink/devl_internal.h |  20 ++----
>  net/devlink/netlink.c       |  19 ++++--
>  6 files changed, 136 insertions(+), 112 deletions(-)
> 

^ permalink raw reply

* [PATCH v2] octeontx2_pf: Select NET_DEVLINK when enabling OCTEONTX2_PF
From: Paul Gazzillo @ 2022-12-19 17:19 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Randy Dunlap, Yang Yingliang, Zheng Bin, Suman Ghosh,
	Sunil Goutham, Subbaraya Sundeep, netdev, linux-kernel
  Cc: Paul Gazzillo

When using COMPILE_TEST, the driver controlled by OCTEONTX2_PF does
not select NET_DEVLINK while the related OCTEONTX2_AF driver does.
This means that when OCTEONTX2_PF is enabled from a default
configuration, linker errors will occur due to undefined references to
code controlled by NET_DEVLINK.

1. make.cross ARCH=x86_64 defconfig
2. make.cross ARCH=x86_64 menuconfig
3. Enable COMPILE_TEST
   General setup  --->
     Compile also drivers which will not load
4. Enable OCTEONTX2_PF
   Device Drivers  --->
     Network device support  --->
       Ethernet driver support  --->
         Marvell OcteonTX2 NIC Physical Function driver
5. Exit and save configuration.  NET_DEVLINK will still be disabled.
6. make.cross ARCH=x86_64 several linker errors, for example,
   ld: drivers/net/ethernet/marvell/octeontx2/nic/otx2_devlink.o:
     in function `otx2_register_dl':
   otx2_devlink.c:(.text+0x142): undefined reference to `devlink_alloc_ns'

This fix adds "select NET_DEVLINK" link to OCTEONTX2_PF's Kconfig
specification to match OCTEONTX2_AF.

Fixes: 2da489432747 ("octeontx2-pf: devlink params support to set mcam entry count")
Signed-off-by: Paul Gazzillo <paul@pgazz.com>
---
v1 -> v2: Added the fixes tag

 drivers/net/ethernet/marvell/octeontx2/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/marvell/octeontx2/Kconfig b/drivers/net/ethernet/marvell/octeontx2/Kconfig
index 3f982ccf2c85..639893d87055 100644
--- a/drivers/net/ethernet/marvell/octeontx2/Kconfig
+++ b/drivers/net/ethernet/marvell/octeontx2/Kconfig
@@ -31,6 +31,7 @@ config NDC_DIS_DYNAMIC_CACHING
 config OCTEONTX2_PF
 	tristate "Marvell OcteonTX2 NIC Physical Function driver"
 	select OCTEONTX2_MBOX
+	select NET_DEVLINK
 	depends on (64BIT && COMPILE_TEST) || ARM64
 	depends on PCI
 	depends on PTP_1588_CLOCK_OPTIONAL
-- 
2.25.1


^ permalink raw reply related

* Re: [syzbot] kernel BUG in rxrpc_put_peer
From: syzbot @ 2022-12-19 17:17 UTC (permalink / raw)
  To: davem, dhowells, edumazet, kuba, linux-afs, linux-kernel,
	marc.dionne, netdev, pabeni, syzkaller-bugs
In-Reply-To: <1535853.1671468284@warthog.procyon.org.uk>

Hello,

syzbot has tested the proposed patch but the reproducer is still triggering an issue:
INFO: rcu detected stall in corrupted

rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { P5792 } 2657 jiffies s: 2825 root: 0x0/T
rcu: blocking rcu_node structures (internal RCU debug):

Tested on:

commit:         6529d700 rxrpc: Move client call connection to the I/O..
git tree:       https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/
console output: https://syzkaller.appspot.com/x/log.txt?x=13f3b420480000
kernel config:  https://syzkaller.appspot.com/x/.config?x=b0e91ad4b5f69c47
dashboard link: https://syzkaller.appspot.com/bug?extid=c22650d2844392afdcfd
compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2

Note: no patches were applied.

^ permalink raw reply

* [syzbot] WARNING in put_pmu_ctx
From: syzbot @ 2022-12-19 17:15 UTC (permalink / raw)
  To: acme, alexander.shishkin, bpf, jolsa, linux-kernel,
	linux-perf-users, mark.rutland, mingo, namhyung, netdev, peterz,
	syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    13e3c7793e2f Merge tag 'for-netdev' of https://git.kernel...
git tree:       bpf
console output: https://syzkaller.appspot.com/x/log.txt?x=12eda6e7880000
kernel config:  https://syzkaller.appspot.com/x/.config?x=b0e91ad4b5f69c47
dashboard link: https://syzkaller.appspot.com/bug?extid=697196bc0265049822bd
compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=114a2127880000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/373a99daa295/disk-13e3c779.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/7fa71ed0fe17/vmlinux-13e3c779.xz
kernel image: https://storage.googleapis.com/syzbot-assets/2842ad5c698b/bzImage-13e3c779.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+697196bc0265049822bd@syzkaller.appspotmail.com

------------[ cut here ]------------
WARNING: CPU: 1 PID: 22650 at kernel/events/core.c:4920 put_pmu_ctx kernel/events/core.c:4920 [inline]
WARNING: CPU: 1 PID: 22650 at kernel/events/core.c:4920 put_pmu_ctx+0x2a5/0x390 kernel/events/core.c:4893
Modules linked in:
CPU: 1 PID: 22650 Comm: syz-executor.4 Not tainted 6.1.0-syzkaller-09661-g13e3c7793e2f #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
RIP: 0010:put_pmu_ctx kernel/events/core.c:4920 [inline]
RIP: 0010:put_pmu_ctx+0x2a5/0x390 kernel/events/core.c:4893
Code: dd ff e8 2e 0f dd ff 48 8d 7b 50 48 c7 c6 a0 f8 a2 81 e8 3e c8 c7 ff eb d6 e8 17 0f dd ff 0f 0b e9 64 ff ff ff e8 0b 0f dd ff <0f> 0b eb 88 e8 c2 bc 2a 00 eb a5 e8 fb 0e dd ff 0f 0b e9 e4 fd ff
RSP: 0018:ffffc90003f47c68 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff8880b9841978 RCX: 0000000000000000
RDX: ffff88801fc6ba80 RSI: ffffffff81a3a405 RDI: 0000000000000001
RBP: ffff8880b98419a8 R08: 0000000000000001 R09: 0000000000000001
R10: ffffed1017306b78 R11: 0000000000000000 R12: ffff8880b9835c90
R13: ffff8880b9835bc0 R14: 0000000000000293 R15: ffff8880b9841980
FS:  00007f1cdf1fe700(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f78cc3157e2 CR3: 000000001ccc0000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 _free_event+0x3c5/0x13d0 kernel/events/core.c:5196
 put_event kernel/events/core.c:5283 [inline]
 perf_event_release_kernel+0x6ad/0x8f0 kernel/events/core.c:5395
 perf_release+0x37/0x50 kernel/events/core.c:5405
 __fput+0x27c/0xa90 fs/file_table.c:320
 task_work_run+0x16f/0x270 kernel/task_work.c:179
 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
 exit_to_user_mode_prepare+0x23c/0x250 kernel/entry/common.c:203
 __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline]
 syscall_exit_to_user_mode+0x1d/0x50 kernel/entry/common.c:296
 do_syscall_64+0x46/0xb0 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f1cdfe8c0d9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f1cdf1fe168 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f1cdffabf80 RCX: 00007f1cdfe8c0d9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 00007f1cdfee7ae9 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd2718bc2f R14: 00007f1cdf1fe300 R15: 0000000000022000
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: [PATCH v7 net-next 0/5] add PLCA RS support and onsemi NCN26000
From: Jakub Kicinski @ 2022-12-19 17:15 UTC (permalink / raw)
  To: Piergiorgio Beruto
  Cc: Andrew Lunn, Heiner Kallweit, Russell King, David S. Miller,
	Eric Dumazet, Paolo Abeni, linux-kernel, netdev, Oleksij Rempel
In-Reply-To: <Y52Vp+xMSUS2CgJe@gvm01>

On Sat, 17 Dec 2022 11:10:47 +0100 Piergiorgio Beruto wrote:
> > # Form letter - net-next is closed
> > 
> > We have already submitted the networking pull request to Linus
> > for v6.2 and therefore net-next is closed for new drivers, features,
> > code refactoring and optimizations. We are currently accepting
> > bug fixes only.
> > 
> > Please repost when net-next reopens after Jan 2nd.
> > 
> > RFC patches sent for review only are obviously welcome at any time.  
>
> Hello Jakub, sorry for asking dumb questions, but what exactly "RFC"
> means? I understand you cannot accept new submissions at this time, but
> does this means the patchset I just submitted can still be reviewd so
> they are ready for integration on Jan 2nd?

Yes, exactly. You can keep posting new versions, to get reviews and get
the code ready for merging once net-next re-opens. Once net-next
re-opens you'll need to repost (it'd be too much work for us to keep
track of all "pending" work during the shutdown).

By the RFC I mean - change the [PATCH net-next] to [RFC net-next] or
[PATCH RFC net-next] if you post during shutdown, so that we know that
you know that the patches can't be merged _right now_.

^ permalink raw reply

* [PATCH] octeontx2_pf: Select NET_DEVLINK when enabling OCTEONTX2_PF
From: Paul Gazzillo @ 2022-12-19 17:11 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Randy Dunlap, Zheng Bin, Yang Yingliang, Suman Ghosh,
	Subbaraya Sundeep, Sunil Goutham, netdev, linux-kernel
  Cc: Paul Gazzillo

When using COMPILE_TEST, the driver controlled by OCTEONTX2_PF does
not select NET_DEVLINK while the related OCTEONTX2_AF driver does.
This means that when OCTEONTX2_PF is enabled from a default
configuration, linker errors will occur due to undefined references to
code controlled by NET_DEVLINK.

1. make.cross ARCH=x86_64 defconfig
2. make.cross ARCH=x86_64 menuconfig
3. Enable COMPILE_TEST
   General setup  --->
     Compile also drivers which will not load
4. Enable OCTEONTX2_PF
   Device Drivers  --->
     Network device support  --->
       Ethernet driver support  --->
         Marvell OcteonTX2 NIC Physical Function driver
5. Exit and save configuration.  NET_DEVLINK will still be disabled.
6. make.cross ARCH=x86_64 several linker errors, for example,
   ld: drivers/net/ethernet/marvell/octeontx2/nic/otx2_devlink.o:
     in function `otx2_register_dl':
   otx2_devlink.c:(.text+0x142): undefined reference to `devlink_alloc_ns'

This fix adds "select NET_DEVLINK" link to OCTEONTX2_PF's Kconfig
specification to match OCTEONTX2_AF.

Fixes: 2da489432747 ("octeontx2-pf: devlink params support to set mcam entry count")
Signed-off-by: Paul Gazzillo <paul@pgazz.com>
---
v1 -> v2: Added the fixes tag

 drivers/net/ethernet/marvell/octeontx2/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/marvell/octeontx2/Kconfig b/drivers/net/ethernet/marvell/octeontx2/Kconfig
index 3f982ccf2c85..639893d87055 100644
--- a/drivers/net/ethernet/marvell/octeontx2/Kconfig
+++ b/drivers/net/ethernet/marvell/octeontx2/Kconfig
@@ -31,6 +31,7 @@ config NDC_DIS_DYNAMIC_CACHING
 config OCTEONTX2_PF
 	tristate "Marvell OcteonTX2 NIC Physical Function driver"
 	select OCTEONTX2_MBOX
+	select NET_DEVLINK
 	depends on (64BIT && COMPILE_TEST) || ARM64
 	depends on PCI
 	depends on PTP_1588_CLOCK_OPTIONAL
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH] octeontx2_pf: Select NET_DEVLINK when enabling OCTEONTX2_PF
From: Paul Gazzillo @ 2022-12-19 17:08 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Randy Dunlap,
	Zheng Bin, Yang Yingliang, Suman Ghosh, netdev, linux-kernel
In-Reply-To: <20221216212943.05813d44@kernel.org>

On 12/16/2022, Jakub Kicinski wrote:
> On Fri, 16 Dec 2022 16:54:48 -0500 Paul Gazzillo wrote:
> > This fix adds "select NET_DEVLINK" link to OCTEONTX2_PF's Kconfig
> > specification to match OCTEONTX2_AF.
> 
> Where was the use of devlink introduced in OCTEONTX2_PF?
> Could you provide a Fixes tag?

It looks like it was introduced in

  2da489432747 ("octeontx2-pf: devlink params support to set mcam entry count")

I will send a new version of the patch with a "Fixes:" tag now.

^ permalink raw reply

* [RFC PATCH net-next 5/5] net/smc: logic of cursors update in SMC-D loopback connections
From: Wen Gu @ 2022-12-19 17:07 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel
In-Reply-To: <1671469668-82691-1-git-send-email-guwen@linux.alibaba.com>

Since local sndbuf of SMC-D loopback connection shares the same
physical memory region with peer RMB, the logic of cursors update
needs to be adapted.

The main difference from original implementation is need to ensure
that the data copied to local sndbuf won't overwrite the unconsumed
data of peer.

So, for SMC-D loopback connections:

1. TX
        a. don't update fin_curs when send out cdc msg.
        b. fin_curs and sndbuf_space update will be deferred until
           receiving peer cons_curs update.

2. RX
        a. same as before. peer sndbuf is as large as local rmb,
           which guarantees that prod_curs will behind prep_curs.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
 net/smc/smc_cdc.c      | 53 +++++++++++++++++++++++++++++++++++++++-----------
 net/smc/smc_loopback.c |  7 +++++++
 2 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 61f5ff7..586472a 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -253,17 +253,26 @@ int smcd_cdc_msg_send(struct smc_connection *conn)
 		return rc;
 	smc_curs_copy(&conn->rx_curs_confirmed, &curs, conn);
 	conn->local_rx_ctrl.prod_flags.cons_curs_upd_req = 0;
-	/* Calculate transmitted data and increment free send buffer space */
-	diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
-			     &conn->tx_curs_sent);
-	/* increased by confirmed number of bytes */
-	smp_mb__before_atomic();
-	atomic_add(diff, &conn->sndbuf_space);
-	/* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
-	smp_mb__after_atomic();
-	smc_curs_copy(&conn->tx_curs_fin, &conn->tx_curs_sent, conn);
+	if (!conn->lgr->smcd->is_loopback) {
+		/* Note:
+		 * For smcd loopback device:
+		 *
+		 * Don't update the fin_curs and sndbuf_space here.
+		 * Update fin_curs when peer consumes the data in RMB.
+		 */
 
-	smc_tx_sndbuf_nonfull(smc);
+		/* Calculate transmitted data and increment free send buffer space */
+		diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
+				     &conn->tx_curs_sent);
+		/* increased by confirmed number of bytes */
+		smp_mb__before_atomic();
+		atomic_add(diff, &conn->sndbuf_space);
+		/* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
+		smp_mb__after_atomic();
+		smc_curs_copy(&conn->tx_curs_fin, &conn->tx_curs_sent, conn);
+
+		smc_tx_sndbuf_nonfull(smc);
+	}
 	return rc;
 }
 
@@ -321,7 +330,7 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
 {
 	union smc_host_cursor cons_old, prod_old;
 	struct smc_connection *conn = &smc->conn;
-	int diff_cons, diff_prod;
+	int diff_cons, diff_prod, diff_tx;
 
 	smc_curs_copy(&prod_old, &conn->local_rx_ctrl.prod, conn);
 	smc_curs_copy(&cons_old, &conn->local_rx_ctrl.cons, conn);
@@ -337,6 +346,28 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
 		atomic_add(diff_cons, &conn->peer_rmbe_space);
 		/* guarantee 0 <= peer_rmbe_space <= peer_rmbe_size */
 		smp_mb__after_atomic();
+
+		/* For smcd loopback device:
+		 * Update of peer cons_curs indicates that
+		 * 1. peer rmbe space increases.
+		 * 2. local sndbuf space increases.
+		 *
+		 * So local sndbuf fin_curs should be equal to peer RMB cons_curs.
+		 */
+		if (conn->lgr->is_smcd &&
+		    conn->lgr->smcd->is_loopback) {
+			/* calculate peer rmb consumed data */
+			diff_tx = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
+						&conn->local_rx_ctrl.cons);
+			/* increase local sndbuf space and fin_curs */
+			smp_mb__before_atomic();
+			atomic_add(diff_tx, &conn->sndbuf_space);
+			/* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
+			smp_mb__after_atomic();
+			smc_curs_copy(&conn->tx_curs_fin, &conn->local_rx_ctrl.cons, conn);
+
+			smc_tx_sndbuf_nonfull(smc);
+		}
 	}
 
 	diff_prod = smc_curs_diff(conn->rmb_desc->len, &prod_old,
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 2c8540d..9b2bdf282 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -223,6 +223,13 @@ int lo_move_data(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx,
 	struct lo_dmb_node *rmb_node = NULL, *tmp_node;
 	struct lo_dev *ldev = smcd->priv;
 
+	if (!sf) {
+		/* no need to move data.
+		 * sndbuf is equal to peer rmb.
+		 */
+		return 0;
+	}
+
 	read_lock(&ldev->dmb_ht_lock);
 	hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
 		if (tmp_node->token == dmb_tok) {
-- 
1.8.3.1


^ permalink raw reply related

* [RFC PATCH net-next 4/5] net/smc: avoid data copy from sndbuf to peer RMB in SMC-D loopback
From: Wen Gu @ 2022-12-19 17:07 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel
In-Reply-To: <1671469668-82691-1-git-send-email-guwen@linux.alibaba.com>

This patch aims to improve SMC-D loopback performance by avoiding
data copy from local sndbuf to peer RMB. The main idea is to let
local sndbuf and peer RMB share the same physical memory.

 +----------+                     +----------+
 | socket A |                     | socket B |
 +----------+                     +----------+
       |                               ^
       |         +---------+           |
  regard as      |         | ----------|
  local sndbuf   |  B's    |     regard as
       |         |  RMB    |     local RMB
       |-------> |         |
                 +---------+

For connections using smcd loopback device:

1. Only create and maintain local RMB.
        a. Create or reuse RMB when create connection;
        b. Free RMB when lgr free;

2. Attach local sndbuf to peer RMB.
        a. sndbuf_desc describes the same memory region as peer rmb_desc.
        b. sndbuf_desc is exclusive to specific connection and won't be
           added to lgr buffer pool for reuse.
        c. sndbuf is attached to peer RMB when receive remote token after
           CLC accept/confirm message.
        d. sndbuf is detached from peer RMB when connection is freed.

Therefore, the data copied from the userspace to local sndbuf directly
reaches the peer RMB.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
 net/smc/af_smc.c   | 23 +++++++++++++++++++-
 net/smc/smc_core.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/smc/smc_core.h |  2 ++
 3 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index b9884c8..c7de566 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1073,7 +1073,6 @@ static int smc_find_proposal_devices(struct smc_sock *smc,
 	 * The RFC patch hasn't resolved this, just simply always
 	 * chooses loopback device first, and fallback if loopback
 	 * communication is impossible.
-	 *
 	 */
 	/* check if there is an ism or loopback device available */
 	if (!(ini->smcd_version & SMC_V1) ||
@@ -1397,6 +1396,17 @@ static int smc_connect_ism(struct smc_sock *smc,
 	}
 
 	smc_conn_save_peer_info(smc, aclc);
+
+	/* special for smcd loopback
+	 * conns above smcd loopback dev only create their rmbs.
+	 * their sndbufs are 'maps' of peer rmbs.
+	 */
+	if (smc->conn.lgr->smcd->is_loopback) {
+		rc = smcd_buf_attach(&smc->conn);
+		if (rc)
+			goto connect_abort;
+		smc->sk.sk_sndbuf = 2 * (smc->conn.sndbuf_desc->len);
+	}
 	smc_close_init(smc);
 	smc_rx_init(smc);
 	smc_tx_init(smc);
@@ -2464,6 +2474,17 @@ static void smc_listen_work(struct work_struct *work)
 		mutex_unlock(&smc_server_lgr_pending);
 	}
 	smc_conn_save_peer_info(new_smc, cclc);
+
+	/* special for smcd loopback
+	 * conns above smcd loopback dev only create their rmbs.
+	 * their sndbufs are 'maps' of peer rmbs.
+	 */
+	if (ini->is_smcd && new_smc->conn.lgr->smcd->is_loopback) {
+		rc = smcd_buf_attach(&new_smc->conn);
+		if (rc)
+			goto out_decl;
+		new_smc->sk.sk_sndbuf = 2 * (new_smc->conn.sndbuf_desc->len);
+	}
 	smc_listen_out_connected(new_smc);
 	SMC_STAT_SERV_SUCC_INC(sock_net(newclcsock->sk), ini);
 	goto out_free;
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index c305d8d..bf40ad3 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1171,6 +1171,10 @@ void smc_conn_free(struct smc_connection *conn)
 		if (!list_empty(&lgr->list))
 			smc_ism_unset_conn(conn);
 		tasklet_kill(&conn->rx_tsklet);
+
+		/* detach sndbuf from peer rmb */
+		if (lgr->smcd->is_loopback)
+			smcd_buf_detach(conn);
 	} else {
 		smc_cdc_wait_pend_tx_wr(conn);
 		if (current_work() != &conn->abort_work)
@@ -2423,6 +2427,14 @@ int smc_buf_create(struct smc_sock *smc, bool is_smcd)
 {
 	int rc;
 
+	if (is_smcd && smc->conn.lgr->smcd->is_loopback) {
+		/* Conns above smcd loopback device only create and maintain
+		 * their RMBs. The sndbufs will be attached to peer RMBs once
+		 * getting the tokens.
+		 */
+		return __smc_buf_create(smc, is_smcd, true);
+	}
+
 	/* create send buffer */
 	rc = __smc_buf_create(smc, is_smcd, false);
 	if (rc)
@@ -2439,6 +2451,56 @@ int smc_buf_create(struct smc_sock *smc, bool is_smcd)
 	return rc;
 }
 
+/* for smcd loopback conns, attach local sndbuf to peer RMB.
+ * The data copy to sndbuf is equal to data copy to peer RMB.
+ */
+int smcd_buf_attach(struct smc_connection *conn)
+{
+	struct smcd_dev *smcd = conn->lgr->smcd;
+	u64 peer_token = conn->peer_token;
+	struct smc_buf_desc *buf_desc;
+	int rc;
+
+	buf_desc = kzalloc(sizeof(*buf_desc), GFP_KERNEL);
+	if (!buf_desc)
+		return -ENOMEM;
+	rc = smc_ism_attach_dmb(smcd, peer_token, buf_desc);
+	if (rc) {
+		rc = SMC_CLC_DECL_ERR_RTOK;
+		goto free;
+	}
+
+	/* attach local sndbuf to peer RMB.
+	 * refer to local sndbuf is equal to refer to peer RMB.
+	 */
+	/* align with peer rmb */
+	buf_desc->cpu_addr = (u8 *)buf_desc->cpu_addr + sizeof(struct smcd_cdc_msg);
+	buf_desc->len -=  sizeof(struct smcd_cdc_msg);
+	conn->sndbuf_desc = buf_desc;
+	conn->sndbuf_desc->used = 1;
+	//smc->sk.sk_sndbuf = 2 * (smc->conn->sndbuf_desc->len);
+	atomic_set(&conn->sndbuf_space, conn->sndbuf_desc->len);
+	return 0;
+
+free:
+	kfree(buf_desc);
+	return rc;
+}
+
+void smcd_buf_detach(struct smc_connection *conn)
+{
+	struct smcd_dev *smcd = conn->lgr->smcd;
+	u64 peer_token = conn->peer_token;
+
+	if (!conn->sndbuf_desc)
+		return;
+
+	smc_ism_detach_dmb(smcd, peer_token);
+
+	kfree(conn->sndbuf_desc);
+	conn->sndbuf_desc = NULL;
+}
+
 static inline int smc_rmb_reserve_rtoken_idx(struct smc_link_group *lgr)
 {
 	int i;
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 285f9bd..b51b020 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -518,6 +518,8 @@ void smc_smcd_terminate(struct smcd_dev *dev, u64 peer_gid,
 void smc_smcd_terminate_all(struct smcd_dev *dev);
 void smc_smcr_terminate_all(struct smc_ib_device *smcibdev);
 int smc_buf_create(struct smc_sock *smc, bool is_smcd);
+int smcd_buf_attach(struct smc_connection *conn);
+void smcd_buf_detach(struct smc_connection *conn);
 int smc_uncompress_bufsize(u8 compressed);
 int smc_rmb_rtoken_handling(struct smc_connection *conn, struct smc_link *link,
 			    struct smc_clc_msg_accept_confirm *clc);
-- 
1.8.3.1


^ permalink raw reply related

* [RFC PATCH net-next 3/5] net/smc: add dmb attach and detach interface
From: Wen Gu @ 2022-12-19 17:07 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel
In-Reply-To: <1671469668-82691-1-git-send-email-guwen@linux.alibaba.com>

This patch extends smcd_ops, adding two more semantic for
SMC-D device:

- attach_dmb:

  Attach an already registered dmb to a specific buf_desc,
  so that we can refer to the dmb through this buf_desc.

- detach_dmb:

  Reverse operation of attach_dmb. detach the dmb from the
  buf_desc.

This interface extension is to prepare for the reduction
of data moving from sndbuf to RMB in SMC-D loopback device.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
 include/net/smc.h      |  2 ++
 net/smc/smc_ism.c      | 36 ++++++++++++++++++++++++++
 net/smc/smc_ism.h      |  2 ++
 net/smc/smc_loopback.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++
 net/smc/smc_loopback.h |  4 +++
 5 files changed, 113 insertions(+)

diff --git a/include/net/smc.h b/include/net/smc.h
index 7699f97..60a96f7 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -63,6 +63,8 @@ struct smcd_ops {
 				u32 vid);
 	int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
 	int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+	int (*attach_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+	int (*detach_dmb)(struct smcd_dev *dev, u64 token);
 	int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
 	int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
 	int (*set_vlan_required)(struct smcd_dev *dev);
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 1d10435..2049388 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -202,6 +202,42 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
 	return rc;
 }
 
+int smc_ism_attach_dmb(struct smcd_dev *dev, u64 token,
+		       struct smc_buf_desc *dmb_desc)
+{
+	struct smcd_dmb dmb;
+	int rc = 0;
+
+	memset(&dmb, 0, sizeof(dmb));
+	dmb.dmb_tok = token;
+
+	/* only support loopback device now */
+	if (!dev->is_loopback)
+		return -EINVAL;
+	if (!dev->ops->attach_dmb)
+		return -EINVAL;
+
+	rc = dev->ops->attach_dmb(dev, &dmb);
+	if (!rc) {
+		dmb_desc->sba_idx = dmb.sba_idx;
+		dmb_desc->token = dmb.dmb_tok;
+		dmb_desc->cpu_addr = dmb.cpu_addr;
+		dmb_desc->dma_addr = dmb.dma_addr;
+		dmb_desc->len = dmb.dmb_len;
+	}
+	return rc;
+}
+
+int smc_ism_detach_dmb(struct smcd_dev *dev, u64 token)
+{
+	if (!dev->is_loopback)
+		return -EINVAL;
+	if (!dev->ops->detach_dmb)
+		return -EINVAL;
+
+	return dev->ops->detach_dmb(dev, token);
+}
+
 static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
 				  struct sk_buff *skb,
 				  struct netlink_callback *cb)
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index d6b2db6..9022979 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -38,6 +38,8 @@ struct smc_ism_vlanid {			/* VLAN id set on ISM device */
 int smc_ism_register_dmb(struct smc_link_group *lgr, int buf_size,
 			 struct smc_buf_desc *dmb_desc);
 int smc_ism_unregister_dmb(struct smcd_dev *dev, struct smc_buf_desc *dmb_desc);
+int smc_ism_attach_dmb(struct smcd_dev *dev, u64 token, struct smc_buf_desc *dmb_desc);
+int smc_ism_detach_dmb(struct smcd_dev *dev, u64 token);
 int smc_ism_signal_shutdown(struct smc_link_group *lgr);
 void smc_ism_get_system_eid(u8 **eid);
 u16 smc_ism_get_chid(struct smcd_dev *dev);
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 803156e..2c8540d 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -75,6 +75,7 @@ int lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
 	}
 	dmb_node->len = dmb->dmb_len;
 	dmb_node->dma_addr = (dma_addr_t)dmb_node->cpu_addr;
+	refcount_set(&dmb_node->refcnt, 1);
 
 	/* TODO: token is random but not exclusive !
 	 * suppose to find token in dmb hask table, if has this token
@@ -85,6 +86,7 @@ int lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
 	write_lock(&ldev->dmb_ht_lock);
 	hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
 	write_unlock(&ldev->dmb_ht_lock);
+	atomic_inc(&ldev->dmb_cnt);
 
 	dmb->sba_idx = dmb_node->sba_idx;
 	dmb->dmb_tok = dmb_node->token;
@@ -122,9 +124,69 @@ int lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
 	write_unlock(&ldev->dmb_ht_lock);
 
 	clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
+
+	/* wait for dmb refcnt equal to 0 */
+	if (!refcount_dec_and_test(&dmb_node->refcnt))
+		wait_event(ldev->dmbs_release, !refcount_read(&dmb_node->refcnt));
 	kfree(dmb_node->cpu_addr);
 	kfree(dmb_node);
 
+	if (atomic_dec_and_test(&ldev->dmb_cnt))
+		wake_up(&ldev->ldev_release);
+
+	return 0;
+}
+
+int lo_attach_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+	struct lo_dmb_node *dmb_node = NULL, *tmp_node;
+	struct lo_dev *ldev = smcd->priv;
+
+	/* find dmb_node according to dmb->dmb_tok */
+	read_lock(&ldev->dmb_ht_lock);
+	hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
+		if (tmp_node->token == dmb->dmb_tok) {
+			dmb_node = tmp_node;
+			break;
+		}
+	}
+	if (!dmb_node) {
+		read_unlock(&ldev->dmb_ht_lock);
+		return -EINVAL;
+	}
+	read_unlock(&ldev->dmb_ht_lock);
+	refcount_inc(&dmb_node->refcnt);
+
+	/* provide dmb information */
+	dmb->sba_idx = dmb_node->sba_idx;
+	dmb->dmb_tok = dmb_node->token;
+	dmb->cpu_addr = dmb_node->cpu_addr;
+	dmb->dma_addr = dmb_node->dma_addr;
+	dmb->dmb_len = dmb_node->len;
+	return 0;
+}
+
+int lo_detach_dmb(struct smcd_dev *smcd, u64 token)
+{
+	struct lo_dmb_node *dmb_node = NULL, *tmp_node;
+	struct lo_dev *ldev = smcd->priv;
+
+	/* find dmb_node according to dmb->dmb_tok */
+	read_lock(&ldev->dmb_ht_lock);
+	hash_for_each_possible(ldev->dmb_ht, tmp_node, list, token) {
+		if (tmp_node->token == token) {
+			dmb_node = tmp_node;
+			break;
+		}
+	}
+	if (!dmb_node) {
+		read_unlock(&ldev->dmb_ht_lock);
+		return -EINVAL;
+	}
+	read_unlock(&ldev->dmb_ht_lock);
+
+	if (refcount_dec_and_test(&dmb_node->refcnt))
+		wake_up_all(&ldev->dmbs_release);
 	return 0;
 }
 
@@ -200,6 +262,8 @@ u16 lo_get_chid(struct smcd_dev *smcd)
 	.query_remote_gid = lo_query_rgid,
 	.register_dmb = lo_register_dmb,
 	.unregister_dmb = lo_unregister_dmb,
+	.attach_dmb = lo_attach_dmb,
+	.detach_dmb = lo_detach_dmb,
 	.add_vlan_id = lo_add_vlan_id,
 	.del_vlan_id = lo_del_vlan_id,
 	.set_vlan_required = lo_set_vlan_required,
@@ -225,6 +289,9 @@ static int lo_dev_init(struct lo_dev *ldev)
 	ldev->lgid = smcd->local_gid;
 	rwlock_init(&ldev->dmb_ht_lock);
 	hash_init(ldev->dmb_ht);
+	atomic_set(&ldev->dmb_cnt, 0);
+	init_waitqueue_head(&ldev->dmbs_release);
+	init_waitqueue_head(&ldev->ldev_release);
 
 	return smcd_register_dev(smcd);
 }
@@ -262,6 +329,8 @@ static int lo_dev_probe(void)
 static void lo_dev_exit(struct lo_dev *ldev)
 {
 	smcd_unregister_dev(ldev->smcd);
+	if (atomic_read(&ldev->dmb_cnt))
+		wait_event(ldev->ldev_release, !atomic_read(&ldev->dmb_cnt));
 }
 
 static void lo_dev_remove(void)
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index d7f7815..f4122be 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -32,6 +32,7 @@ struct lo_dmb_node {
 	u32 sba_idx;
 	void *cpu_addr;
 	dma_addr_t dma_addr;
+	refcount_t refcnt;
 };
 
 struct lo_dev {
@@ -41,6 +42,9 @@ struct lo_dev {
 	DECLARE_BITMAP(sba_idx_mask, LODEV_MAX_DMBS);
 	rwlock_t dmb_ht_lock;
 	DECLARE_HASHTABLE(dmb_ht, LODEV_MAX_DMBS_BUCKETS);
+	atomic_t dmb_cnt;
+	wait_queue_head_t dmbs_release;
+	wait_queue_head_t ldev_release;
 };
 
 struct lo_systemeid {
-- 
1.8.3.1


^ permalink raw reply related

* [RFC PATCH net-next 2/5] net/smc: choose loopback device in SMC-D communication
From: Wen Gu @ 2022-12-19 17:07 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel
In-Reply-To: <1671469668-82691-1-git-send-email-guwen@linux.alibaba.com>

This patch allows SMC-D to use loopback device.

But noted that the implementation here is quiet simple and informal.
Loopback device will always be firstly chosen, and fallback happens
if loopback communication is impossible.

It needs to be discussed how client indicates to peer that multiple
SMC-D devices are available and how server picks a suitable one.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
 net/smc/af_smc.c  | 55 +++++++++++++++++++++++++++++++++++++++++++++++++------
 net/smc/smc_clc.c |  4 +++-
 net/smc/smc_ism.c |  3 ++-
 3 files changed, 54 insertions(+), 8 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 9546c02..b9884c8 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -979,6 +979,28 @@ static int smc_find_ism_device(struct smc_sock *smc, struct smc_init_info *ini)
 	return 0;
 }
 
+/* check if there is a lo device available for this connection. */
+static int smc_find_lo_device(struct smc_sock *smc, struct smc_init_info *ini)
+{
+	struct smcd_dev *sdev;
+
+	mutex_lock(&smcd_dev_list.mutex);
+	list_for_each_entry(sdev, &smcd_dev_list.list, list) {
+		if (sdev->is_loopback && !sdev->going_away &&
+		    (!ini->ism_peer_gid[0] ||
+		     !smc_ism_cantalk(ini->ism_peer_gid[0], ini->vlan_id,
+				      sdev))) {
+			ini->ism_dev[0] = sdev;
+			break;
+		}
+	}
+	mutex_unlock(&smcd_dev_list.mutex);
+	if (!ini->ism_dev[0])
+		return SMC_CLC_DECL_NOSMCDDEV;
+	ini->ism_chid[0] = smc_ism_get_chid(ini->ism_dev[0]);
+	return 0;
+}
+
 /* is chid unique for the ism devices that are already determined? */
 static bool smc_find_ism_v2_is_unique_chid(u16 chid, struct smc_init_info *ini,
 					   int cnt)
@@ -1044,10 +1066,20 @@ static int smc_find_proposal_devices(struct smc_sock *smc,
 {
 	int rc = 0;
 
-	/* check if there is an ism device available */
+	/* TODO:
+	 * How to indicate to peer if ism device and loopback
+	 * device are both available ?
+	 *
+	 * The RFC patch hasn't resolved this, just simply always
+	 * chooses loopback device first, and fallback if loopback
+	 * communication is impossible.
+	 *
+	 */
+	/* check if there is an ism or loopback device available */
 	if (!(ini->smcd_version & SMC_V1) ||
-	    smc_find_ism_device(smc, ini) ||
-	    smc_connect_ism_vlan_setup(smc, ini))
+	    (smc_find_lo_device(smc, ini) &&
+	    (smc_find_ism_device(smc, ini) ||
+	    smc_connect_ism_vlan_setup(smc, ini))))
 		ini->smcd_version &= ~SMC_V1;
 	/* else ISM V1 is supported for this connection */
 
@@ -2135,9 +2167,20 @@ static void smc_find_ism_v1_device_serv(struct smc_sock *new_smc,
 		goto not_found;
 	ini->is_smcd = true; /* prepare ISM check */
 	ini->ism_peer_gid[0] = ntohll(pclc_smcd->ism.gid);
-	rc = smc_find_ism_device(new_smc, ini);
-	if (rc)
-		goto not_found;
+
+	/* TODO:
+	 * How to know that peer has both loopback and ism device ?
+	 *
+	 * The RFC patch hasn't resolved this, simply tries loopback
+	 * device first, then ism device.
+	 */
+	/* find available loopback or ism device */
+	if (smc_find_lo_device(new_smc, ini)) {
+		rc = smc_find_ism_device(new_smc, ini);
+		if (rc)
+			goto not_found;
+	}
+
 	ini->ism_selected = 0;
 	rc = smc_listen_ism_init(new_smc, ini);
 	if (!rc)
diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
index dfb9797..3887692 100644
--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -486,7 +486,9 @@ static int smc_clc_prfx_set4_rcu(struct dst_entry *dst, __be32 ipv4,
 		return -ENODEV;
 
 	in_dev_for_each_ifa_rcu(ifa, in_dev) {
-		if (!inet_ifa_match(ipv4, ifa))
+		/* add loopback support */
+		if (inet_addr_type(dev_net(dst->dev), ipv4) != RTN_LOCAL &&
+		    !inet_ifa_match(ipv4, ifa))
 			continue;
 		prop->prefix_len = inet_mask_len(ifa->ifa_mask);
 		prop->outgoing_subnet = ifa->ifa_address & ifa->ifa_mask;
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 911fe08..1d10435 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -227,7 +227,8 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
 	if (nla_put_u8(skb, SMC_NLA_DEV_IS_CRIT, use_cnt > 0))
 		goto errattr;
 	memset(&smc_pci_dev, 0, sizeof(smc_pci_dev));
-	smc_set_pci_values(to_pci_dev(smcd->dev.parent), &smc_pci_dev);
+	if (!smcd->is_loopback)
+		smc_set_pci_values(to_pci_dev(smcd->dev.parent), &smc_pci_dev);
 	if (nla_put_u32(skb, SMC_NLA_DEV_PCI_FID, smc_pci_dev.pci_fid))
 		goto errattr;
 	if (nla_put_u16(skb, SMC_NLA_DEV_PCI_CHID, smc_pci_dev.pci_pchid))
-- 
1.8.3.1


^ permalink raw reply related

* [RFC PATCH net-next 1/5] net/smc: introduce SMC-D loopback device
From: Wen Gu @ 2022-12-19 17:07 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel
In-Reply-To: <1671469668-82691-1-git-send-email-guwen@linux.alibaba.com>

This patch introduces a kind of loopback device for SMC-D, thus
enabling the SMC communication between two local sockets in one
kernel.

The loopback device supports basic capabilities defined by SMC-D,
including registering DMB, unregistering DMB and moving data.

Considering that there is no ism device on other servers expect
IBM z13, the loopback device can be used as a dummy device to
test SMC-D logic for the broad community.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
 include/net/smc.h      |   1 +
 net/smc/Makefile       |   2 +-
 net/smc/af_smc.c       |  12 +-
 net/smc/smc_cdc.c      |   6 +
 net/smc/smc_cdc.h      |   1 +
 net/smc/smc_loopback.c | 289 +++++++++++++++++++++++++++++++++++++++++++++++++
 net/smc/smc_loopback.h |  59 ++++++++++
 7 files changed, 368 insertions(+), 2 deletions(-)
 create mode 100644 net/smc/smc_loopback.c
 create mode 100644 net/smc/smc_loopback.h

diff --git a/include/net/smc.h b/include/net/smc.h
index c926d33..7699f97 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -93,6 +93,7 @@ struct smcd_dev {
 	atomic_t lgr_cnt;
 	wait_queue_head_t lgrs_deleted;
 	u8 going_away : 1;
+	u8 is_loopback : 1;
 };
 
 struct smcd_dev *smcd_alloc_dev(struct device *parent, const char *name,
diff --git a/net/smc/Makefile b/net/smc/Makefile
index 875efcd..a8c3711 100644
--- a/net/smc/Makefile
+++ b/net/smc/Makefile
@@ -4,5 +4,5 @@ obj-$(CONFIG_SMC)	+= smc.o
 obj-$(CONFIG_SMC_DIAG)	+= smc_diag.o
 smc-y := af_smc.o smc_pnet.o smc_ib.o smc_clc.o smc_core.o smc_wr.o smc_llc.o
 smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_ism.o smc_netlink.o smc_stats.o
-smc-y += smc_tracepoint.o
+smc-y += smc_tracepoint.o smc_loopback.o
 smc-$(CONFIG_SYSCTL) += smc_sysctl.o
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index e12d4fa..9546c02 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -52,6 +52,7 @@
 #include "smc_stats.h"
 #include "smc_tracepoint.h"
 #include "smc_sysctl.h"
+#include "smc_loopback.h"
 
 static DEFINE_MUTEX(smc_server_lgr_pending);	/* serialize link group
 						 * creation on server
@@ -3451,15 +3452,23 @@ static int __init smc_init(void)
 		goto out_sock;
 	}
 
+	rc = smc_loopback_init();
+	if (rc) {
+		pr_err("%s: smc_loopback_init fails with %d\n", __func__, rc);
+		goto out_ib;
+	}
+
 	rc = tcp_register_ulp(&smc_ulp_ops);
 	if (rc) {
 		pr_err("%s: tcp_ulp_register fails with %d\n", __func__, rc);
-		goto out_ib;
+		goto out_lo;
 	}
 
 	static_branch_enable(&tcp_have_smc);
 	return 0;
 
+out_lo:
+	smc_loopback_exit();
 out_ib:
 	smc_ib_unregister_client();
 out_sock:
@@ -3494,6 +3503,7 @@ static void __exit smc_exit(void)
 	tcp_unregister_ulp(&smc_ulp_ops);
 	sock_unregister(PF_SMC);
 	smc_core_exit();
+	smc_loopback_exit();
 	smc_ib_unregister_client();
 	destroy_workqueue(smc_close_wq);
 	destroy_workqueue(smc_tcp_ls_wq);
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 53f63bf..61f5ff7 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -408,6 +408,12 @@ static void smc_cdc_msg_recv(struct smc_sock *smc, struct smc_cdc_msg *cdc)
 static void smcd_cdc_rx_tsklet(struct tasklet_struct *t)
 {
 	struct smc_connection *conn = from_tasklet(conn, t, rx_tsklet);
+
+	smcd_cdc_rx_handler(conn);
+}
+
+void smcd_cdc_rx_handler(struct smc_connection *conn)
+{
 	struct smcd_cdc_msg *data_cdc;
 	struct smcd_cdc_msg cdc;
 	struct smc_sock *smc;
diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
index 696cc11..11559d4 100644
--- a/net/smc/smc_cdc.h
+++ b/net/smc/smc_cdc.h
@@ -301,5 +301,6 @@ int smcr_cdc_msg_send_validation(struct smc_connection *conn,
 				 struct smc_wr_buf *wr_buf);
 int smc_cdc_init(void) __init;
 void smcd_cdc_rx_init(struct smc_connection *conn);
+void smcd_cdc_rx_handler(struct smc_connection *conn);
 
 #endif /* SMC_CDC_H */
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
new file mode 100644
index 0000000..803156e
--- /dev/null
+++ b/net/smc/smc_loopback.c
@@ -0,0 +1,289 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *  Shared Memory Communications Direct over loopback device.
+ *
+ *  Provide a SMC-D loopback dummy device.
+ *
+ *  Copyright (c) 2022, Alibaba Inc.
+ *
+ *  Author: Wen Gu <guwen@linux.alibaba.com>
+ *          Tony Lu <tonylu@linux.alibaba.com>
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/types.h>
+#include <net/smc.h>
+
+#include "smc_cdc.h"
+#include "smc_loopback.h"
+
+#define DRV_NAME "smc_lodev"
+
+struct lo_dev *lo_dev;
+
+static const struct pci_device_id lo_dev_table[] = {
+	{ PCI_VDEVICE(IBM, PCI_ANY_ID), 0 },
+	{ 0, }
+};
+MODULE_DEVICE_TABLE(pci, lo_dev_table);
+
+static struct lo_systemeid LO_SYSTEM_EID = {
+	.seid_string = "SMC-SYSZ-LOSEID000000000",
+	.serial_number = "0000",
+	.type = "0000",
+};
+
+int lo_query_rgid(struct smcd_dev *smcd, u64 rgid, u32 vid_valid,
+		  u32 vid)
+{
+	struct lo_dev *ldev = smcd->priv;
+
+	/* return local gid */
+	if (!ldev || rgid != ldev->lgid)
+		return -ENETUNREACH;
+	return 0;
+}
+
+int lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+	struct lo_dev *ldev = smcd->priv;
+	struct lo_dmb_node *dmb_node;
+	int sba_idx, rc;
+
+	/* check space for new dmb */
+	for_each_clear_bit(sba_idx, ldev->sba_idx_mask, LODEV_MAX_DMBS) {
+		if (!test_and_set_bit(sba_idx, ldev->sba_idx_mask))
+			break;
+	}
+	if (sba_idx == LODEV_MAX_DMBS)
+		return -ENOSPC;
+
+	dmb_node = kzalloc(sizeof(*dmb_node), GFP_KERNEL);
+	if (!dmb_node) {
+		rc = -ENOMEM;
+		goto err_bit;
+	}
+
+	dmb_node->sba_idx = sba_idx;
+	dmb_node->cpu_addr = kzalloc(dmb->dmb_len, GFP_KERNEL |
+			     __GFP_NOWARN | __GFP_NORETRY |
+			     __GFP_NOMEMALLOC);
+	if (!dmb_node->cpu_addr) {
+		rc = -ENOMEM;
+		goto err_node;
+	}
+	dmb_node->len = dmb->dmb_len;
+	dmb_node->dma_addr = (dma_addr_t)dmb_node->cpu_addr;
+
+	/* TODO: token is random but not exclusive !
+	 * suppose to find token in dmb hask table, if has this token
+	 * already, then generate another one.
+	 */
+	/* add new dmb into hash table */
+	get_random_bytes(&dmb_node->token, sizeof(dmb_node->token));
+	write_lock(&ldev->dmb_ht_lock);
+	hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
+	write_unlock(&ldev->dmb_ht_lock);
+
+	dmb->sba_idx = dmb_node->sba_idx;
+	dmb->dmb_tok = dmb_node->token;
+	dmb->cpu_addr = dmb_node->cpu_addr;
+	dmb->dma_addr = dmb_node->dma_addr;
+	dmb->dmb_len = dmb_node->len;
+
+	return 0;
+
+err_node:
+	kfree(dmb_node);
+err_bit:
+	clear_bit(sba_idx, ldev->sba_idx_mask);
+	return rc;
+}
+
+int lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+	struct lo_dmb_node *dmb_node = NULL, *tmp_node;
+	struct lo_dev *ldev = smcd->priv;
+
+	/* remove dmb from hash table */
+	write_lock(&ldev->dmb_ht_lock);
+	hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
+		if (tmp_node->token == dmb->dmb_tok) {
+			dmb_node = tmp_node;
+			break;
+		}
+	}
+	if (!dmb_node) {
+		write_unlock(&ldev->dmb_ht_lock);
+		return -EINVAL;
+	}
+	hash_del(&dmb_node->list);
+	write_unlock(&ldev->dmb_ht_lock);
+
+	clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
+	kfree(dmb_node->cpu_addr);
+	kfree(dmb_node);
+
+	return 0;
+}
+
+int lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
+{
+	return 0;
+}
+
+int lo_del_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
+{
+	return 0;
+}
+
+int lo_set_vlan_required(struct smcd_dev *smcd)
+{
+	return 0;
+}
+
+int lo_reset_vlan_required(struct smcd_dev *smcd)
+{
+	return 0;
+}
+
+int lo_signal_ieq(struct smcd_dev *smcd, u64 rgid, u32 trigger_irq,
+		  u32 event_code, u64 info)
+{
+	return 0;
+}
+
+int lo_move_data(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx,
+		 bool sf, unsigned int offset, void *data,
+		  unsigned int size)
+{
+	struct lo_dmb_node *rmb_node = NULL, *tmp_node;
+	struct lo_dev *ldev = smcd->priv;
+
+	read_lock(&ldev->dmb_ht_lock);
+	hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
+		if (tmp_node->token == dmb_tok) {
+			rmb_node = tmp_node;
+			break;
+		}
+	}
+	if (!rmb_node) {
+		read_unlock(&ldev->dmb_ht_lock);
+		return -EINVAL;
+	}
+	read_unlock(&ldev->dmb_ht_lock);
+
+	memcpy((char *)rmb_node->cpu_addr + offset, data, size);
+
+	if (sf) {
+		struct smc_connection *conn =
+			smcd->conn[rmb_node->sba_idx];
+
+		if (conn && !conn->killed)
+			smcd_cdc_rx_handler(conn);
+	}
+	return 0;
+}
+
+u8 *lo_get_system_eid(void)
+{
+	return &LO_SYSTEM_EID.seid_string[0];
+}
+
+u16 lo_get_chid(struct smcd_dev *smcd)
+{
+	return 0;
+}
+
+static const struct smcd_ops lo_ops = {
+	.query_remote_gid = lo_query_rgid,
+	.register_dmb = lo_register_dmb,
+	.unregister_dmb = lo_unregister_dmb,
+	.add_vlan_id = lo_add_vlan_id,
+	.del_vlan_id = lo_del_vlan_id,
+	.set_vlan_required = lo_set_vlan_required,
+	.reset_vlan_required = lo_reset_vlan_required,
+	.signal_event = lo_signal_ieq,
+	.move_data = lo_move_data,
+	.get_system_eid = lo_get_system_eid,
+	.get_chid = lo_get_chid,
+};
+
+static int lo_dev_init(struct lo_dev *ldev)
+{
+	struct smcd_dev *smcd = ldev->smcd;
+
+	/* smcd related */
+	smcd->is_loopback = 1;
+	smcd->priv = ldev;
+	get_random_bytes(&smcd->local_gid, sizeof(smcd->local_gid));
+
+	/* ldev related */
+	/* TODO: lgid is random but not exclusive !
+	 */
+	ldev->lgid = smcd->local_gid;
+	rwlock_init(&ldev->dmb_ht_lock);
+	hash_init(ldev->dmb_ht);
+
+	return smcd_register_dev(smcd);
+}
+
+static int lo_dev_probe(void)
+{
+	struct lo_dev *ldev;
+	int ret;
+
+	ldev = kzalloc(sizeof(*ldev), GFP_KERNEL);
+	if (!ldev)
+		return -ENOMEM;
+
+	ldev->smcd = smcd_alloc_dev(NULL, "smcd-loopback-dev",
+				    &lo_ops, LODEV_MAX_DMBS);
+	if (!ldev->smcd) {
+		ret = -ENOMEM;
+		goto err_ldev;
+	}
+
+	ret = lo_dev_init(ldev);
+	if (ret)
+		goto err_smcd;
+
+	lo_dev = ldev;
+	return 0;
+
+err_smcd:
+	smcd_free_dev(ldev->smcd);
+err_ldev:
+	kfree(ldev);
+	return ret;
+}
+
+static void lo_dev_exit(struct lo_dev *ldev)
+{
+	smcd_unregister_dev(ldev->smcd);
+}
+
+static void lo_dev_remove(void)
+{
+	if (!lo_dev)
+		return;
+
+	lo_dev_exit(lo_dev);
+	smcd_free_dev(lo_dev->smcd);
+	kfree(lo_dev);
+}
+
+int smc_loopback_init(void)
+{
+	/* TODO: now lo_dev is a global device shared by
+	 * the whole kernel, and can't be referred to by
+	 * smc-tools command 'smcd dev'. Need to be improved.
+	 */
+	return lo_dev_probe();
+}
+
+void smc_loopback_exit(void)
+{
+	lo_dev_remove();
+}
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
new file mode 100644
index 0000000..d7f7815
--- /dev/null
+++ b/net/smc/smc_loopback.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ *  Shared Memory Communications Direct over loopback device.
+ *
+ *  Provide a SMC-D loopback dummy device.
+ *
+ *  Copyright (c) 2022, Alibaba Inc.
+ *
+ *  Author: Wen Gu <guwen@linux.alibaba.com>
+ *          Tony Lu <tonylu@linux.alibaba.com>
+ *
+ */
+
+#ifndef _SMC_LOOPBACK_H
+#define _SMC_LOOPBACK_H
+
+#include <linux/types.h>
+#include <linux/interrupt.h>
+#include <linux/device.h>
+#include <linux/err.h>
+#include <net/smc.h>
+
+#include "smc_core.h"
+
+#define LODEV_MAX_DMBS 5000
+#define LODEV_MAX_DMBS_BUCKETS 16
+
+struct lo_dmb_node {
+	struct hlist_node list;
+	u64 token;
+	u32 len;
+	u32 sba_idx;
+	void *cpu_addr;
+	dma_addr_t dma_addr;
+};
+
+struct lo_dev {
+	struct smcd_dev *smcd;
+	/* priv data */
+	u64 lgid;
+	DECLARE_BITMAP(sba_idx_mask, LODEV_MAX_DMBS);
+	rwlock_t dmb_ht_lock;
+	DECLARE_HASHTABLE(dmb_ht, LODEV_MAX_DMBS_BUCKETS);
+};
+
+struct lo_systemeid {
+	u8 seid_string[24];
+	u8 serial_number[4];
+	u8 type[4];
+};
+
+/* smcd loopback dev*/
+extern struct lo_dev *lo_dev;
+
+int smc_loopback_init(void);
+void smc_loopback_exit(void);
+
+#endif /* _SMC_LOOPBACK_H */
+
-- 
1.8.3.1


^ permalink raw reply related

* [RFC PATCH net-next 0/5] Introduce SMC-D based loopback acceleration
From: Wen Gu @ 2022-12-19 17:07 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, davem, edumazet, kuba, pabeni
  Cc: linux-s390, netdev, linux-kernel

Hi, all

# Background

As previously mentioned in [1], we (Alibaba Cloud) are trying to use SMC
to accelerate TCP applications in cloud environment, improving inter-host
or inter-VM communication.

In addition of these, we also found the value of SMC-D in scenario of local
inter-process communication, such as accelerate communication between containers
within the same host. So this RFC tries to provide a SMC-D loopback solution
in such scenario, to bring a significant improvement in latency and throughput
compared to TCP loopback.

# Design

This patch set provides a kind of SMC-D loopback solution.

Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing for the
inter-process communication acceleration. Except for loopback acceleration,
the dummy device can also meet the requirements mentioned in [2], which is
providing a way to test SMC-D logic for broad community without ISM device.

 +------------------------------------------+
 |  +-----------+           +-----------+   |
 |  | process A |           | process B |   |
 |  +-----------+           +-----------+   |
 |       ^                        ^         |
 |       |    +---------------+   |         |
 |       |    |   SMC stack   |   |         |
 |       +--->| +-----------+ |<--|         |
 |            | |   dummy   | |             |
 |            | |   device  | |             |
 |            +-+-----------+-+             |
 |                   VM                     |
 +------------------------------------------+

Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from sndbuf to RMB
and improve SMC-D loopback performance. Through extending smcd_ops with two
new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the same
physical memory region with receiver's RMB. The data copied from userspace
to sender's sndbuf directly reaches the receiver's RMB without unnecessary
memory copy in the same kernel.

 +----------+                     +----------+
 | socket A |                     | socket B |
 +----------+                     +----------+
       |                               ^
       |         +---------+           |
  regard as      |         | ----------|
  local sndbuf   |  B's    |     regard as
       |         |  RMB    |     local RMB
       |-------> |         |
                 +---------+

# Benchmark Test

 * Test environments:
      - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
      - SMC sndbuf/RMB size 1MB.

 * Test object:
      - TCP: run on TCP loopback.
      - domain: run on UNIX domain.
      - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
      - SMC lo-nocpy: run on SMC loopback device with patch #1/5 ~ #5/5.

1. ipc-benchmark (see [3])

 - ./<foo> -c 1000000 -s 100

                       TCP              domain              SMC-lo             SMC-lo-nocpy
Message
rate (msg/s)         75140      129548(+72.41)    152266(+102.64%)         151914(+102.17%)

2. sockperf

 - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
 - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30

                       TCP                  SMC-lo             SMC-lo-nocpy
Bandwidth(MBps)   4943.359        4936.096(-0.15%)        8239.624(+66.68%)
Latency(us)          6.372          3.359(-47.28%)            3.25(-49.00%)

3. nginx/wrk

 - serv: <smc_run> nginx
 - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80

                       TCP                   SMC-lo             SMC-lo-nocpy
Requests/s       154643.22       220894.03(+42.84%)        226754.3(+46.63%)

# Discussion

1. API between SMC-D and ISM device

As Jan mentioned in [2], IBM are working on placing an API between SMC-D
and the ISM device for easier use of different "devices" for SMC-D.

So, considering that the introduction of attach_dmb or detach_dmb can
effectively avoid data copying from sndbuf to RMB and brings obvious
throughput advantages in inter-VM or inter-process scenarios, can the
attach/detach semantics be taken into consideration when designing the
API to make it a standard ISM device behavior?

Maybe our RFC of SMC-D based inter-process acceleration (this one) and
inter-VM acceleration (will coming soon, which is the update of [1])
can provide some examples for new API design. And we are very glad to
discuss this on the mail list.

2. Way to select different ISM-like devices

With the proposal of SMC-D loopback 'device' (this RFC) and incoming
device used for inter-VM acceleration as update of [1], SMC-D has more
options to choose from. So we need to consider that how to indicate
supported devices, how to determine which one to use, and their priority...

IMHO, this may require an update of CLC message and negotiation mechanism.
Again, we are very glad to discuss this with you on the mailing list.

[1] https://lore.kernel.org/netdev/20220720170048.20806-1-tonylu@linux.alibaba.com/
[2] https://lore.kernel.org/netdev/35d14144-28f7-6129-d6d3-ba16dae7a646@linux.ibm.com/
[3] https://github.com/goldsborough/ipc-bench

Wen Gu (5):
  net/smc: introduce SMC-D loopback device
  net/smc: choose loopback device in SMC-D communication
  net/smc: add dmb attach and detach interface
  net/smc: avoid data copy from sndbuf to peer RMB in SMC-D loopback
  net/smc: logic of cursors update in SMC-D loopback connections

 include/net/smc.h      |   3 +
 net/smc/Makefile       |   2 +-
 net/smc/af_smc.c       |  88 +++++++++++-
 net/smc/smc_cdc.c      |  59 ++++++--
 net/smc/smc_cdc.h      |   1 +
 net/smc/smc_clc.c      |   4 +-
 net/smc/smc_core.c     |  62 +++++++++
 net/smc/smc_core.h     |   2 +
 net/smc/smc_ism.c      |  39 +++++-
 net/smc/smc_ism.h      |   2 +
 net/smc/smc_loopback.c | 365 +++++++++++++++++++++++++++++++++++++++++++++++++
 net/smc/smc_loopback.h |  63 +++++++++
 12 files changed, 669 insertions(+), 21 deletions(-)
 create mode 100644 net/smc/smc_loopback.c
 create mode 100644 net/smc/smc_loopback.h

-- 
1.8.3.1

^ permalink raw reply

* Re: iproute2 merge conflicts
From: Wilczynski, Michal @ 2022-12-19 17:00 UTC (permalink / raw)
  To: David Ahern; +Cc: Stephen Hemminger, netdev@vger.kernel.org
In-Reply-To: <ffeb2330-9c1b-2ec6-e5a0-bfdd614b3fb1@gmail.com>

On 12/16/2022 5:14 PM, David Ahern wrote:
> Hi Michal:
>
> I merged main into next and hit a conflict with your recent patch set.
> Can you take a look at devlink/devlink.c, cmd_port_fn_rate_add and
> verify the conflict is correctly resolved?
>
> Thanks,
> David

Hi,
I've checked the file and everything looks good. The problem was that for
net-next tree I added two additional flags in a function call. And also submitted
a bugfix to net that added one more flag in the same function call.
BR,
Michał

^ permalink raw reply

* Re: [syzbot] kernel BUG in rxrpc_put_peer
From: David Howells @ 2022-12-19 16:44 UTC (permalink / raw)
  To: syzbot
  Cc: dhowells, davem, edumazet, kuba, linux-afs, linux-kernel,
	marc.dionne, netdev, pabeni, syzkaller-bugs
In-Reply-To: <0000000000002b4a9f05ef2b616f@google.com>

#syz test: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/ 6529d70012e00166ab2ca4a92c4aa01e30a3037b


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox