Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next] net/ncsi: support unaligned payload size in NC-SI cmd handler
From: Ben Wei @ 2019-09-02  2:46 UTC (permalink / raw)
  To: Ben Wei, David Miller, sam@mendozajonas.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	openbmc@lists.ozlabs.org
  Cc: Ben Wei

Update NC-SI command handler (both standard and OEM) to take into
account of payload paddings in allocating skb (in case of payload
size is not 32-bit aligned).

The checksum field follows payload field, without taking payload
padding into account can cause checksum being truncated, leading to
dropped packets.

Signed-off-by: Ben Wei <benwei@fb.com>
---
 net/ncsi/ncsi-cmd.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
index 0187e65176c0..42636ed3cf3a 100644
--- a/net/ncsi/ncsi-cmd.c
+++ b/net/ncsi/ncsi-cmd.c
@@ -213,17 +213,22 @@ static int ncsi_cmd_handler_oem(struct sk_buff *skb,
 {
 	struct ncsi_cmd_oem_pkt *cmd;
 	unsigned int len;
+	/* NC-SI spec requires payload to be padded with 0
+	 * to 32-bit boundary before the checksum field.
+	 * Ensure the padding bytes are accounted for in
+	 * skb allocation
+	 */
+	unsigned short payload = ALIGN(nca->payload, 4);
 
 	len = sizeof(struct ncsi_cmd_pkt_hdr) + 4;
-	if (nca->payload < 26)
+	if (payload < 26)
 		len += 26;
 	else
-		len += nca->payload;
+		len += payload;
 
 	cmd = skb_put_zero(skb, len);
 	memcpy(&cmd->mfr_id, nca->data, nca->payload);
 	ncsi_cmd_build_header(&cmd->cmd.common, nca);
-
 	return 0;
 }
 
@@ -272,6 +277,7 @@ static struct ncsi_request *ncsi_alloc_command(struct ncsi_cmd_arg *nca)
 	struct net_device *dev = nd->dev;
 	int hlen = LL_RESERVED_SPACE(dev);
 	int tlen = dev->needed_tailroom;
+	int payload;
 	int len = hlen + tlen;
 	struct sk_buff *skb;
 	struct ncsi_request *nr;
@@ -281,14 +287,17 @@ static struct ncsi_request *ncsi_alloc_command(struct ncsi_cmd_arg *nca)
 		return NULL;
 
 	/* NCSI command packet has 16-bytes header, payload, 4 bytes checksum.
+	 * Payload needs padding so that the checksum field follwoing payload is
+	 * aligned to 32bit boundary.
 	 * The packet needs padding if its payload is less than 26 bytes to
 	 * meet 64 bytes minimal ethernet frame length.
 	 */
 	len += sizeof(struct ncsi_cmd_pkt_hdr) + 4;
-	if (nca->payload < 26)
+	payload = ALIGN(nca->payload, 4);
+	if (payload < 26)
 		len += 26;
 	else
-		len += nca->payload;
+		len += payload;
 
 	/* Allocate skb */
 	skb = alloc_skb(len, GFP_ATOMIC);
-- 
2.17.1


^ permalink raw reply related

* RE: [PATCH net-next] r8152: fix accessing skb after napi_gro_receive
From: Hayes Wang @ 2019-09-02  3:11 UTC (permalink / raw)
  To: Eric Dumazet, netdev@vger.kernel.org
  Cc: nic_swsd, linux-kernel@vger.kernel.org
In-Reply-To: <b39bc8a1-54c7-42d4-00ed-d48aa1bac734@gmail.com>

Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Friday, August 30, 2019 12:32 AM
> To: Hayes Wang; netdev@vger.kernel.org
> Cc: nic_swsd; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH net-next] r8152: fix accessing skb after napi_gro_receive
> 
> On 8/19/19 5:15 AM, Hayes Wang wrote:
> > Fix accessing skb after napi_gro_receive which is caused by
> > commit 47922fcde536 ("r8152: support skb_add_rx_frag").
> >
> > Fixes: 47922fcde536 ("r8152: support skb_add_rx_frag")
> > Signed-off-by: Hayes Wang <hayeswang@realtek.com>
> > ---
> 
> It is customary to add a tag to credit the reporter...
> 
> Something like :
> 
> Reported-by: ....
> 
> Thanks.

Sorry. It's my mistake.
I would note that next time.

Best Regards,
Hayes



^ permalink raw reply

* Re: kernel panic: stack is corrupted in __lock_acquire (4)
From: Dmitry Vyukov @ 2019-09-02  3:23 UTC (permalink / raw)
  To: syzbot, bpf; +Cc: LKML, netdev, syzkaller-bugs
In-Reply-To: <0000000000000ec274059185a63e@google.com>

On Sun, Sep 1, 2019 at 3:48 PM syzbot
<syzbot+83979935eb6304f8cd46@syzkaller.appspotmail.com> wrote:
>
> syzbot has found a reproducer for the following crash on:
>
> HEAD commit:    38320f69 Merge branch 'Minor-cleanup-in-devlink'
> git tree:       net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=13d74356600000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=1bbf70b6300045af
> dashboard link: https://syzkaller.appspot.com/bug?extid=83979935eb6304f8cd46
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1008b232600000

Stack corruption + bpf maps in repro triggers some bells. +bpf mailing list.

> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+83979935eb6304f8cd46@syzkaller.appspotmail.com
>
> Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in:
> __lock_acquire+0x36fa/0x4c30 kernel/locking/lockdep.c:3907
> CPU: 0 PID: 8662 Comm: syz-executor.4 Not tainted 5.3.0-rc6+ #153
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/0000000000000ec274059185a63e%40google.com.

^ permalink raw reply

* Re: kernel panic: stack is corrupted in lock_release (2)
From: Dmitry Vyukov @ 2019-09-02  3:24 UTC (permalink / raw)
  To: syzbot; +Cc: LKML, netdev, syzkaller-bugs, bpf
In-Reply-To: <00000000000088cdb2059186312f@google.com>

On Sun, Sep 1, 2019 at 4:27 PM syzbot
<syzbot+97deee97cf14574b96d0@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    dd7078f0 enetc: Add missing call to 'pci_free_irq_vectors(..
> git tree:       net
> console output: https://syzkaller.appspot.com/x/log.txt?x=115fe0fa600000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=2a6a2b9826fdadf9
> dashboard link: https://syzkaller.appspot.com/bug?extid=97deee97cf14574b96d0
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=11f7c2fe600000

Stack corruption + bpf maps in repro triggers some bells. +bpf mailing list.

> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+97deee97cf14574b96d0@syzkaller.appspotmail.com
>
> Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in:
> lock_release+0x866/0x960 kernel/locking/lockdep.c:4435
> CPU: 0 PID: 9965 Comm: syz-executor.0 Not tainted 5.3.0-rc6+ #182
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this bug, for details see:
> https://goo.gl/tpsmEJ#testing-patches
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/00000000000088cdb2059186312f%40google.com.

^ permalink raw reply

* Re: KASAN: use-after-free Write in __xfrm_policy_unlink (2)
From: Dmitry Vyukov @ 2019-09-02  3:31 UTC (permalink / raw)
  To: syzbot
  Cc: David Miller, Herbert Xu, LKML, netdev, Steffen Klassert,
	syzkaller-bugs
In-Reply-To: <000000000000cd5fdf0588fed11c@google.com>

On Thu, May 16, 2019 at 3:35 AM syzbot
<syzbot+0025447b4cb6f208558f@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    3b0f31f2 genetlink: make policy common to family
> git tree:       net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=12a319df200000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=f05902bca21d8935
> dashboard link: https://syzkaller.appspot.com/bug?extid=0025447b4cb6f208558f
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+0025447b4cb6f208558f@syzkaller.appspotmail.com

This looks like what has been fixed by:

#syz fix:
xfrm: policy: Fix out-of-bound array accesses in __xfrm_policy_unlink


> ==================================================================
> BUG: KASAN: use-after-free in __write_once_size
> include/linux/compiler.h:220 [inline]
> BUG: KASAN: use-after-free in __hlist_del include/linux/list.h:713 [inline]
> BUG: KASAN: use-after-free in hlist_del_rcu include/linux/rculist.h:455
> [inline]
> BUG: KASAN: use-after-free in __xfrm_policy_unlink+0x4b1/0x5c0
> net/xfrm/xfrm_policy.c:2212
> Write of size 8 at addr ffff8880a55a9e80 by task kworker/u4:6/7431
>
> CPU: 1 PID: 7431 Comm: kworker/u4:6 Not tainted 5.0.0+ #106
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Workqueue: netns cleanup_net
> Call Trace:
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0x172/0x1f0 lib/dump_stack.c:113
>   print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
>   kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
>   __asan_report_store8_noabort+0x17/0x20 mm/kasan/generic_report.c:137
>   __write_once_size include/linux/compiler.h:220 [inline]
>   __hlist_del include/linux/list.h:713 [inline]
>   hlist_del_rcu include/linux/rculist.h:455 [inline]
>   __xfrm_policy_unlink+0x4b1/0x5c0 net/xfrm/xfrm_policy.c:2212
>   xfrm_policy_flush+0x331/0x460 net/xfrm/xfrm_policy.c:1789
>   xfrm_policy_fini+0x49/0x3a0 net/xfrm/xfrm_policy.c:3871
>   xfrm_net_exit+0x1d/0x70 net/xfrm/xfrm_policy.c:3933
>   ops_exit_list.isra.0+0xb0/0x160 net/core/net_namespace.c:153
>   cleanup_net+0x3fb/0x960 net/core/net_namespace.c:551
>   process_one_work+0x98e/0x1790 kernel/workqueue.c:2269
>   worker_thread+0x98/0xe40 kernel/workqueue.c:2415
>   kthread+0x357/0x430 kernel/kthread.c:253
>   ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
>
> Allocated by task 7242:
>   save_stack+0x45/0xd0 mm/kasan/common.c:75
>   set_track mm/kasan/common.c:87 [inline]
>   __kasan_kmalloc mm/kasan/common.c:497 [inline]
>   __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:470
>   kasan_kmalloc+0x9/0x10 mm/kasan/common.c:511
>   __do_kmalloc mm/slab.c:3726 [inline]
>   __kmalloc+0x15c/0x740 mm/slab.c:3735
>   kmalloc include/linux/slab.h:550 [inline]
>   kzalloc include/linux/slab.h:740 [inline]
>   ext4_htree_store_dirent+0x8a/0x650 fs/ext4/dir.c:450
>   htree_dirblock_to_tree+0x4fe/0x910 fs/ext4/namei.c:1021
>   ext4_htree_fill_tree+0x252/0xa50 fs/ext4/namei.c:1098
>   ext4_dx_readdir fs/ext4/dir.c:574 [inline]
>   ext4_readdir+0x1999/0x3490 fs/ext4/dir.c:121
>   iterate_dir+0x489/0x5f0 fs/readdir.c:51
>   __do_sys_getdents fs/readdir.c:231 [inline]
>   __se_sys_getdents fs/readdir.c:212 [inline]
>   __x64_sys_getdents+0x1dd/0x370 fs/readdir.c:212
>   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> Freed by task 7242:
>   save_stack+0x45/0xd0 mm/kasan/common.c:75
>   set_track mm/kasan/common.c:87 [inline]
>   __kasan_slab_free+0x102/0x150 mm/kasan/common.c:459
>   kasan_slab_free+0xe/0x10 mm/kasan/common.c:467
>   __cache_free mm/slab.c:3498 [inline]
>   kfree+0xcf/0x230 mm/slab.c:3821
>   free_rb_tree_fname+0x87/0xe0 fs/ext4/dir.c:402
>   ext4_htree_free_dir_info fs/ext4/dir.c:424 [inline]
>   ext4_release_dir+0x46/0x70 fs/ext4/dir.c:622
>   __fput+0x2e5/0x8d0 fs/file_table.c:278
>   ____fput+0x16/0x20 fs/file_table.c:309
>   task_work_run+0x14a/0x1c0 kernel/task_work.c:113
>   tracehook_notify_resume include/linux/tracehook.h:188 [inline]
>   exit_to_usermode_loop+0x273/0x2c0 arch/x86/entry/common.c:166
>   prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
>   syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
>   do_syscall_64+0x52d/0x610 arch/x86/entry/common.c:293
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> The buggy address belongs to the object at ffff8880a55a9e80
>   which belongs to the cache kmalloc-64 of size 64
> The buggy address is located 0 bytes inside of
>   64-byte region [ffff8880a55a9e80, ffff8880a55a9ec0)
> The buggy address belongs to the page:
> page:ffffea0002956a40 count:1 mapcount:0 mapping:ffff88812c3f0340 index:0x0
> flags: 0x1fffc0000000200(slab)
> raw: 01fffc0000000200 ffffea0002a0d748 ffffea00018af1c8 ffff88812c3f0340
> raw: 0000000000000000 ffff8880a55a9000 0000000100000020 0000000000000000
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
>   ffff8880a55a9d80: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc
>   ffff8880a55a9e00: 00 00 00 00 04 fc fc fc fc fc fc fc fc fc fc fc
> > ffff8880a55a9e80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>                     ^
>   ffff8880a55a9f00: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
>   ffff8880a55a9f80: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
> ==================================================================
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/000000000000cd5fdf0588fed11c%40google.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [RFC v3] vhost: introduce mdev based hardware vhost backend
From: Jason Wang @ 2019-09-02  4:15 UTC (permalink / raw)
  To: Tiwei Bie, mst, alex.williamson, maxime.coquelin
  Cc: linux-kernel, kvm, virtualization, netdev, dan.daly,
	cunming.liang, zhihong.wang, lingshan.zhu
In-Reply-To: <20190828053712.26106-1-tiwei.bie@intel.com>


On 2019/8/28 下午1:37, Tiwei Bie wrote:
> Details about this can be found here:
>
> https://lwn.net/Articles/750770/
>
> What's new in this version
> ==========================
>
> There are three choices based on the discussion [1] in RFC v2:
>
>> #1. We expose a VFIO device, so we can reuse the VFIO container/group
>>      based DMA API and potentially reuse a lot of VFIO code in QEMU.
>>
>>      But in this case, we have two choices for the VFIO device interface
>>      (i.e. the interface on top of VFIO device fd):
>>
>>      A) we may invent a new vhost protocol (as demonstrated by the code
>>         in this RFC) on VFIO device fd to make it work in VFIO's way,
>>         i.e. regions and irqs.
>>
>>      B) Or as you proposed, instead of inventing a new vhost protocol,
>>         we can reuse most existing vhost ioctls on the VFIO device fd
>>         directly. There should be no conflicts between the VFIO ioctls
>>         (type is 0x3B) and VHOST ioctls (type is 0xAF) currently.
>>
>> #2. Instead of exposing a VFIO device, we may expose a VHOST device.
>>      And we will introduce a new mdev driver vhost-mdev to do this.
>>      It would be natural to reuse the existing kernel vhost interface
>>      (ioctls) on it as much as possible. But we will need to invent
>>      some APIs for DMA programming (reusing VHOST_SET_MEM_TABLE is a
>>      choice, but it's too heavy and doesn't support vIOMMU by itself).
> This version is more like a quick PoC to try Jason's proposal on
> reusing vhost ioctls. And the second way (#1/B) in above three
> choices was chosen in this version to demonstrate the idea quickly.
>
> Now the userspace API looks like this:
>
> - VFIO's container/group based IOMMU API is used to do the
>    DMA programming.
>
> - Vhost's existing ioctls are used to setup the device.
>
> And the device will report device_api as "vfio-vhost".
>
> Note that, there are dirty hacks in this version. If we decide to
> go this way, some refactoring in vhost.c/vhost.h may be needed.
>
> PS. The direct mapping of the notify registers isn't implemented
>      in this version.
>
> [1] https://lkml.org/lkml/2019/7/9/101


Thanks for the patch, see comments inline.


>
> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> ---
>   drivers/vhost/Kconfig      |   9 +
>   drivers/vhost/Makefile     |   3 +
>   drivers/vhost/mdev.c       | 382 +++++++++++++++++++++++++++++++++++++
>   include/linux/vhost_mdev.h |  58 ++++++
>   include/uapi/linux/vfio.h  |   2 +
>   include/uapi/linux/vhost.h |   8 +
>   6 files changed, 462 insertions(+)
>   create mode 100644 drivers/vhost/mdev.c
>   create mode 100644 include/linux/vhost_mdev.h
>
> diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
> index 3d03ccbd1adc..2ba54fcf43b7 100644
> --- a/drivers/vhost/Kconfig
> +++ b/drivers/vhost/Kconfig
> @@ -34,6 +34,15 @@ config VHOST_VSOCK
>   	To compile this driver as a module, choose M here: the module will be called
>   	vhost_vsock.
>   
> +config VHOST_MDEV
> +	tristate "Hardware vhost accelerator abstraction"
> +	depends on EVENTFD && VFIO && VFIO_MDEV
> +	select VHOST
> +	default n
> +	---help---
> +	Say Y here to enable the vhost_mdev module
> +	for use with hardware vhost accelerators
> +
>   config VHOST
>   	tristate
>   	---help---
> diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
> index 6c6df24f770c..ad9c0f8c6d8c 100644
> --- a/drivers/vhost/Makefile
> +++ b/drivers/vhost/Makefile
> @@ -10,4 +10,7 @@ vhost_vsock-y := vsock.o
>   
>   obj-$(CONFIG_VHOST_RING) += vringh.o
>   
> +obj-$(CONFIG_VHOST_MDEV) += vhost_mdev.o
> +vhost_mdev-y := mdev.o
> +
>   obj-$(CONFIG_VHOST)	+= vhost.o
> diff --git a/drivers/vhost/mdev.c b/drivers/vhost/mdev.c
> new file mode 100644
> index 000000000000..6bef1d9ae2e6
> --- /dev/null
> +++ b/drivers/vhost/mdev.c
> @@ -0,0 +1,382 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2018-2019 Intel Corporation.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/vfio.h>
> +#include <linux/vhost.h>
> +#include <linux/mdev.h>
> +#include <linux/vhost_mdev.h>
> +
> +#include "vhost.h"
> +
> +struct vhost_mdev {
> +	struct vhost_dev dev;
> +	bool opened;
> +	int nvqs;
> +	u64 state;
> +	u64 acked_features;
> +	u64 features;
> +	const struct vhost_mdev_device_ops *ops;
> +	struct mdev_device *mdev;
> +	void *private;
> +	struct vhost_virtqueue vqs[];
> +};
> +
> +static void handle_vq_kick(struct vhost_work *work)
> +{
> +	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> +						  poll.work);
> +	struct vhost_mdev *vdpa = container_of(vq->dev, struct vhost_mdev, dev);
> +
> +	vdpa->ops->notify(vdpa, vq - vdpa->vqs);
> +}
> +
> +static int vhost_set_state(struct vhost_mdev *vdpa, u64 __user *statep)
> +{
> +	u64 state;
> +
> +	if (copy_from_user(&state, statep, sizeof(state)))
> +		return -EFAULT;
> +
> +	if (state >= VHOST_MDEV_S_MAX)
> +		return -EINVAL;
> +
> +	if (vdpa->state == state)
> +		return 0;
> +
> +	mutex_lock(&vdpa->dev.mutex);
> +
> +	vdpa->state = state;
> +
> +	switch (vdpa->state) {
> +	case VHOST_MDEV_S_RUNNING:
> +		vdpa->ops->start(vdpa);
> +		break;
> +	case VHOST_MDEV_S_STOPPED:
> +		vdpa->ops->stop(vdpa);
> +		break;
> +	}
> +
> +	mutex_unlock(&vdpa->dev.mutex);
> +
> +	return 0;
> +}
> +
> +static int vhost_set_features(struct vhost_mdev *vdpa, u64 __user *featurep)
> +{
> +	u64 features;
> +
> +	if (copy_from_user(&features, featurep, sizeof(features)))
> +		return -EFAULT;
> +
> +	if (features & ~vdpa->features)
> +		return -EINVAL;
> +
> +	vdpa->acked_features = features;
> +	vdpa->ops->features_changed(vdpa);
> +	return 0;
> +}
> +
> +static int vhost_get_features(struct vhost_mdev *vdpa, u64 __user *featurep)
> +{
> +	if (copy_to_user(featurep, &vdpa->features, sizeof(vdpa->features)))
> +		return -EFAULT;
> +	return 0;
> +}
> +
> +static int vhost_get_vring_base(struct vhost_mdev *vdpa, void __user *argp)
> +{
> +	struct vhost_virtqueue *vq;
> +	u32 idx;
> +	int r;
> +
> +	r = get_user(idx, (u32 __user *)argp);
> +	if (r < 0)
> +		return r;
> +
> +	vq = &vdpa->vqs[idx];
> +	vq->last_avail_idx = vdpa->ops->get_vring_base(vdpa, idx);
> +
> +	return vhost_vring_ioctl(&vdpa->dev, VHOST_GET_VRING_BASE, argp);
> +}
> +
> +/*
> + * Helpers for backend to register mdev.
> + */
> +
> +struct vhost_mdev *vhost_mdev_alloc(struct mdev_device *mdev, void *private,
> +				    int nvqs)
> +{
> +	struct vhost_mdev *vdpa;
> +	struct vhost_dev *dev;
> +	struct vhost_virtqueue **vqs;
> +	size_t size;
> +	int i;
> +
> +	size = sizeof(struct vhost_mdev) + nvqs * sizeof(struct vhost_virtqueue);
> +
> +	vdpa = kzalloc(size, GFP_KERNEL);
> +	if (!vdpa)
> +		return NULL;
> +
> +	vdpa->nvqs = nvqs;
> +
> +	vqs = kmalloc_array(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs) {
> +		kfree(vdpa);
> +		return NULL;
> +	}
> +
> +	dev = &vdpa->dev;
> +	for (i = 0; i < nvqs; i++) {
> +		vqs[i] = &vdpa->vqs[i];
> +		vqs[i]->handle_kick = handle_vq_kick;
> +	}
> +	vhost_dev_init(dev, vqs, nvqs, 0, 0, 0);
> +
> +	vdpa->private = private;
> +	vdpa->mdev = mdev;
> +
> +	mdev_set_drvdata(mdev, vdpa);
> +
> +	return vdpa;
> +}
> +EXPORT_SYMBOL(vhost_mdev_alloc);
> +
> +void vhost_mdev_free(struct vhost_mdev *vdpa)
> +{
> +	struct mdev_device *mdev;
> +
> +	mdev = vdpa->mdev;
> +	mdev_set_drvdata(mdev, NULL);
> +
> +	vhost_dev_stop(&vdpa->dev);
> +	vhost_dev_cleanup(&vdpa->dev);
> +	kfree(vdpa->dev.vqs);
> +	kfree(vdpa);
> +}
> +EXPORT_SYMBOL(vhost_mdev_free);
> +
> +ssize_t vhost_mdev_read(struct mdev_device *mdev, char __user *buf,
> +		  size_t count, loff_t *ppos)
> +{
> +	return -EINVAL;
> +}
> +EXPORT_SYMBOL(vhost_mdev_read);
> +
> +
> +ssize_t vhost_mdev_write(struct mdev_device *mdev, const char __user *buf,
> +		   size_t count, loff_t *ppos)
> +{
> +	return -EINVAL;
> +}
> +EXPORT_SYMBOL(vhost_mdev_write);
> +
> +int vhost_mdev_mmap(struct mdev_device *mdev, struct vm_area_struct *vma)
> +{
> +	// TODO
> +	return -EINVAL;
> +}
> +EXPORT_SYMBOL(vhost_mdev_mmap);
> +
> +long vhost_mdev_ioctl(struct mdev_device *mdev, unsigned int cmd,
> +		      unsigned long arg)
> +{
> +	void __user *argp = (void __user *)arg;
> +	struct vhost_mdev *vdpa;
> +	unsigned long minsz;
> +	int ret = 0;
> +
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	vdpa = mdev_get_drvdata(mdev);
> +	if (!vdpa)
> +		return -ENODEV;
> +
> +	switch (cmd) {
> +	case VFIO_DEVICE_GET_INFO:
> +	{
> +		struct vfio_device_info info;
> +
> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz)) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		if (info.argsz < minsz) {
> +			ret = -EINVAL;
> +			break;
> +		}
> +
> +		info.flags = VFIO_DEVICE_FLAGS_VHOST;
> +		info.num_regions = 0;
> +		info.num_irqs = 0;
> +
> +		if (copy_to_user((void __user *)arg, &info, minsz)) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		break;
> +	}
> +	case VFIO_DEVICE_GET_REGION_INFO:
> +	case VFIO_DEVICE_GET_IRQ_INFO:
> +	case VFIO_DEVICE_SET_IRQS:
> +	case VFIO_DEVICE_RESET:
> +		ret = -EINVAL;
> +		break;
> +
> +	case VHOST_MDEV_SET_STATE:
> +		ret = vhost_set_state(vdpa, argp);
> +		break;


So this is used to start or stop the device. This means if userspace 
want to drive a network device, the API is not 100% compatible. Any 
blocker for this? E.g for SET_BACKEND, we can pass a fd and then 
identify the type of backend.

Another question is, how can user know the type of a device?


> +	case VHOST_GET_FEATURES:
> +		ret = vhost_get_features(vdpa, argp);
> +		break;
> +	case VHOST_SET_FEATURES:
> +		ret = vhost_set_features(vdpa, argp);
> +		break;
> +	case VHOST_GET_VRING_BASE:
> +		ret = vhost_get_vring_base(vdpa, argp);
> +		break;
> +	default:
> +		ret = vhost_dev_ioctl(&vdpa->dev, cmd, argp);
> +		if (ret == -ENOIOCTLCMD)
> +			ret = vhost_vring_ioctl(&vdpa->dev, cmd, argp);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(vhost_mdev_ioctl);
> +
> +int vhost_mdev_open(struct mdev_device *mdev)
> +{
> +	struct vhost_mdev *vdpa;
> +	int ret = 0;
> +
> +	vdpa = mdev_get_drvdata(mdev);
> +	if (!vdpa)
> +		return -ENODEV;
> +
> +	mutex_lock(&vdpa->dev.mutex);
> +
> +	if (vdpa->opened)
> +		ret = -EBUSY;
> +	else
> +		vdpa->opened = true;
> +
> +	mutex_unlock(&vdpa->dev.mutex);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(vhost_mdev_open);
> +
> +void vhost_mdev_close(struct mdev_device *mdev)
> +{
> +	struct vhost_mdev *vdpa;
> +
> +	vdpa = mdev_get_drvdata(mdev);
> +
> +	mutex_lock(&vdpa->dev.mutex);
> +
> +	vhost_dev_stop(&vdpa->dev);
> +	vhost_dev_cleanup(&vdpa->dev);
> +
> +	vdpa->opened = false;
> +	mutex_unlock(&vdpa->dev.mutex);
> +}
> +EXPORT_SYMBOL(vhost_mdev_close);
> +
> +/*
> + * Helpers for backend to set/get information.
> + */
> +
> +int vhost_mdev_set_device_ops(struct vhost_mdev *vdpa,
> +			      const struct vhost_mdev_device_ops *ops)
> +{
> +	vdpa->ops = ops;
> +	return 0;
> +}
> +EXPORT_SYMBOL(vhost_mdev_set_device_ops);
> +
> +int vhost_mdev_set_features(struct vhost_mdev *vdpa, u64 features)
> +{
> +	vdpa->features = features;
> +	return 0;
> +}
> +EXPORT_SYMBOL(vhost_mdev_set_features);
> +
> +struct eventfd_ctx *
> +vhost_mdev_get_call_ctx(struct vhost_mdev *vdpa, int queue_id)
> +{
> +	return vdpa->vqs[queue_id].call_ctx;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_call_ctx);
> +
> +int vhost_mdev_get_acked_features(struct vhost_mdev *vdpa, u64 *features)
> +{
> +	*features = vdpa->acked_features;
> +	return 0;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_acked_features);
> +
> +int vhost_mdev_get_vring_num(struct vhost_mdev *vdpa, int queue_id, u16 *num)
> +{
> +	*num = vdpa->vqs[queue_id].num;
> +	return 0;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_vring_num);
> +
> +int vhost_mdev_get_vring_base(struct vhost_mdev *vdpa, int queue_id, u16 *base)
> +{
> +	*base = vdpa->vqs[queue_id].last_avail_idx;
> +	return 0;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_vring_base);
> +
> +int vhost_mdev_get_vring_addr(struct vhost_mdev *vdpa, int queue_id,
> +			      struct vhost_vring_addr *addr)
> +{
> +	struct vhost_virtqueue *vq = &vdpa->vqs[queue_id];
> +
> +	/*
> +	 * XXX: we need userspace to pass guest physical address or
> +	 *      IOVA directly.
> +	 */
> +	addr->flags = vq->log_used ? (0x1 << VHOST_VRING_F_LOG) : 0;
> +	addr->desc_user_addr = (__u64)vq->desc;
> +	addr->avail_user_addr = (__u64)vq->avail;
> +	addr->used_user_addr = (__u64)vq->used;
> +	addr->log_guest_addr = (__u64)vq->log_addr;
> +	return 0;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_vring_addr);
> +
> +int vhost_mdev_get_log_base(struct vhost_mdev *vdpa, int queue_id,
> +			    void **log_base, u64 *log_size)
> +{
> +	// TODO
> +	return 0;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_log_base);
> +
> +struct mdev_device *vhost_mdev_get_mdev(struct vhost_mdev *vdpa)
> +{
> +	return vdpa->mdev;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_mdev);
> +
> +void *vhost_mdev_get_private(struct vhost_mdev *vdpa)
> +{
> +	return vdpa->private;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_private);
> +
> +MODULE_VERSION("0.0.0");
> +MODULE_LICENSE("GPL v2");
> +MODULE_DESCRIPTION("Hardware vhost accelerator abstraction");
> diff --git a/include/linux/vhost_mdev.h b/include/linux/vhost_mdev.h
> new file mode 100644
> index 000000000000..070787ce6b36
> --- /dev/null
> +++ b/include/linux/vhost_mdev.h
> @@ -0,0 +1,58 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2018-2019 Intel Corporation.
> + */
> +
> +#ifndef _VHOST_MDEV_H
> +#define _VHOST_MDEV_H
> +
> +struct mdev_device;
> +struct vhost_mdev;
> +
> +typedef int (*vhost_mdev_start_device_t)(struct vhost_mdev *vdpa);
> +typedef int (*vhost_mdev_stop_device_t)(struct vhost_mdev *vdpa);
> +typedef int (*vhost_mdev_set_features_t)(struct vhost_mdev *vdpa);
> +typedef void (*vhost_mdev_notify_device_t)(struct vhost_mdev *vdpa, int queue_id);
> +typedef u64 (*vhost_mdev_get_notify_addr_t)(struct vhost_mdev *vdpa, int queue_id);
> +typedef u16 (*vhost_mdev_get_vring_base_t)(struct vhost_mdev *vdpa, int queue_id);
> +typedef void (*vhost_mdev_features_changed_t)(struct vhost_mdev *vdpa);
> +
> +struct vhost_mdev_device_ops {
> +	vhost_mdev_start_device_t	start;
> +	vhost_mdev_stop_device_t	stop;
> +	vhost_mdev_notify_device_t	notify;
> +	vhost_mdev_get_notify_addr_t	get_notify_addr;
> +	vhost_mdev_get_vring_base_t	get_vring_base;
> +	vhost_mdev_features_changed_t	features_changed;
> +};


Consider we want to implement a network device, who is going to 
implement the device configuration space? I believe it's not good to 
invent another set of API for doing this. So I believe we want something 
like read_config/write_config here.

Then I came up an idea:

1) introduce a new mdev bus transport, and a new mdev driver virtio_mdev
2) vDPA (either software or hardware) can register as a device of virtio 
mdev device
3) then we can use kernel virtio driver to drive vDPA device and utilize 
kernel networking/storage stack
4) for userspace driver like vhost-mdev, it could be built of top of 
mdev transport

Having a full new transport for virtio, the advantages are obvious:

1) A generic solution for both kernel and userspace driver and support 
configuration space access
2) For kernel driver, exist kernel networking/storage stack could be 
reused, and so did fast path implementation (e.g XDP, io_uring etc).
2) For userspace driver, the function of virtio transport is a superset 
of vhost, any API could be built on top easily (e.g vhost ioctl).

What's your thought?

Thanks


> +
> +struct vhost_mdev *vhost_mdev_alloc(struct mdev_device *mdev,
> +		void *private, int nvqs);
> +void vhost_mdev_free(struct vhost_mdev *vdpa);
> +
> +ssize_t vhost_mdev_read(struct mdev_device *mdev, char __user *buf,
> +		size_t count, loff_t *ppos);
> +ssize_t vhost_mdev_write(struct mdev_device *mdev, const char __user *buf,
> +		size_t count, loff_t *ppos);
> +long vhost_mdev_ioctl(struct mdev_device *mdev, unsigned int cmd,
> +		unsigned long arg);
> +int vhost_mdev_mmap(struct mdev_device *mdev, struct vm_area_struct *vma);
> +int vhost_mdev_open(struct mdev_device *mdev);
> +void vhost_mdev_close(struct mdev_device *mdev);
> +
> +int vhost_mdev_set_device_ops(struct vhost_mdev *vdpa,
> +		const struct vhost_mdev_device_ops *ops);
> +int vhost_mdev_set_features(struct vhost_mdev *vdpa, u64 features);
> +struct eventfd_ctx *vhost_mdev_get_call_ctx(struct vhost_mdev *vdpa,
> +		int queue_id);
> +int vhost_mdev_get_acked_features(struct vhost_mdev *vdpa, u64 *features);
> +int vhost_mdev_get_vring_num(struct vhost_mdev *vdpa, int queue_id, u16 *num);
> +int vhost_mdev_get_vring_base(struct vhost_mdev *vdpa, int queue_id, u16 *base);
> +int vhost_mdev_get_vring_addr(struct vhost_mdev *vdpa, int queue_id,
> +		struct vhost_vring_addr *addr);
> +int vhost_mdev_get_log_base(struct vhost_mdev *vdpa, int queue_id,
> +		void **log_base, u64 *log_size);
> +struct mdev_device *vhost_mdev_get_mdev(struct vhost_mdev *vdpa);
> +void *vhost_mdev_get_private(struct vhost_mdev *vdpa);
> +
> +#endif /* _VHOST_MDEV_H */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 8f10748dac79..0300d6831cc5 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -201,6 +201,7 @@ struct vfio_device_info {
>   #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
>   #define VFIO_DEVICE_FLAGS_CCW	(1 << 4)	/* vfio-ccw device */
>   #define VFIO_DEVICE_FLAGS_AP	(1 << 5)	/* vfio-ap device */
> +#define VFIO_DEVICE_FLAGS_VHOST	(1 << 6)	/* vfio-vhost device */
>   	__u32	num_regions;	/* Max region index + 1 */
>   	__u32	num_irqs;	/* Max IRQ index + 1 */
>   };
> @@ -217,6 +218,7 @@ struct vfio_device_info {
>   #define VFIO_DEVICE_API_AMBA_STRING		"vfio-amba"
>   #define VFIO_DEVICE_API_CCW_STRING		"vfio-ccw"
>   #define VFIO_DEVICE_API_AP_STRING		"vfio-ap"
> +#define VFIO_DEVICE_API_VHOST_STRING		"vfio-vhost"
>   
>   /**
>    * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index 40d028eed645..5afbc2f08fa3 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -116,4 +116,12 @@
>   #define VHOST_VSOCK_SET_GUEST_CID	_IOW(VHOST_VIRTIO, 0x60, __u64)
>   #define VHOST_VSOCK_SET_RUNNING		_IOW(VHOST_VIRTIO, 0x61, int)
>   
> +/* VHOST_MDEV specific defines */
> +
> +#define VHOST_MDEV_SET_STATE	_IOW(VHOST_VIRTIO, 0x70, __u64)
> +
> +#define VHOST_MDEV_S_STOPPED	0
> +#define VHOST_MDEV_S_RUNNING	1
> +#define VHOST_MDEV_S_MAX	2
> +
>   #endif

^ permalink raw reply

* [PATCH v3 3/5] mdev: Expose mdev alias in sysfs tree
From: Parav Pandit @ 2019-09-02  4:24 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190902042436.23294-1-parav@mellanox.com>

Expose the optional alias for an mdev device as a sysfs attribute.
This way, userspace tools such as udev may make use of the alias, for
example to create a netdevice name for the mdev.

Updated documentation for optional read only sysfs attribute.

Signed-off-by: Parav Pandit <parav@mellanox.com>

---
Changelog:
v2->v3:
 - Merged sysfs documentation patch with sysfs addition
 - Added more description for alias return value
v0->v1:
 - Addressed comments from Cornelia Huck
 - Updated commit description
---
 Documentation/driver-api/vfio-mediated-device.rst |  9 +++++++++
 drivers/vfio/mdev/mdev_sysfs.c                    | 13 +++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/Documentation/driver-api/vfio-mediated-device.rst b/Documentation/driver-api/vfio-mediated-device.rst
index 25eb7d5b834b..0b7d2bf843b6 100644
--- a/Documentation/driver-api/vfio-mediated-device.rst
+++ b/Documentation/driver-api/vfio-mediated-device.rst
@@ -270,6 +270,7 @@ Directories and Files Under the sysfs for Each mdev Device
          |--- remove
          |--- mdev_type {link to its type}
          |--- vendor-specific-attributes [optional]
+         |--- alias
 
 * remove (write only)
 
@@ -281,6 +282,14 @@ Example::
 
 	# echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove
 
+* alias (read only, optional)
+Whenever a parent requested to generate an alias, each mdev device of such
+parent is assigned unique alias by the mdev core.
+This file shows the alias of the mdev device.
+
+Reading file either returns valid alias when assigned or returns error code
+-EOPNOTSUPP when unsupported.
+
 Mediated device Hot plug
 ------------------------
 
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
index 43afe0e80b76..59f4e3cc5233 100644
--- a/drivers/vfio/mdev/mdev_sysfs.c
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -246,7 +246,20 @@ static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
 
 static DEVICE_ATTR_WO(remove);
 
+static ssize_t alias_show(struct device *device,
+			  struct device_attribute *attr, char *buf)
+{
+	struct mdev_device *dev = mdev_from_dev(device);
+
+	if (!dev->alias)
+		return -EOPNOTSUPP;
+
+	return sprintf(buf, "%s\n", dev->alias);
+}
+static DEVICE_ATTR_RO(alias);
+
 static const struct attribute *mdev_device_attrs[] = {
+	&dev_attr_alias.attr,
 	&dev_attr_remove.attr,
 	NULL,
 };
-- 
2.19.2


^ permalink raw reply related

* [PATCH v3 2/5] mdev: Make mdev alias unique among all mdevs
From: Parav Pandit @ 2019-09-02  4:24 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190902042436.23294-1-parav@mellanox.com>

Mdev alias should be unique among all the mdevs, so that when such alias
is used by the mdev users to derive other objects, there is no
collision in a given system.

Signed-off-by: Parav Pandit <parav@mellanox.com>

---
Changelog:
v2->v3:
 - Changed strcmp() ==0 to !strcmp()
v1->v2:
 - Moved alias NULL check at beginning
v0->v1:
 - Fixed inclusiong of alias for NULL check
 - Added ratelimited debug print for sha1 hash collision error
---
 drivers/vfio/mdev/mdev_core.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 3bdff0469607..c8cd40366783 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -388,6 +388,13 @@ int mdev_device_create(struct kobject *kobj, struct device *dev,
 			ret = -EEXIST;
 			goto mdev_fail;
 		}
+		if (alias && tmp->alias && !strcmp(alias, tmp->alias)) {
+			mutex_unlock(&mdev_list_lock);
+			ret = -EEXIST;
+			dev_dbg_ratelimited(dev, "Hash collision in alias creation for UUID %pUl\n",
+					    uuid);
+			goto mdev_fail;
+		}
 	}
 
 	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
-- 
2.19.2


^ permalink raw reply related

* [PATCH v3 4/5] mdev: Introduce an API mdev_alias
From: Parav Pandit @ 2019-09-02  4:24 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190902042436.23294-1-parav@mellanox.com>

Introduce an API mdev_alias() to provide access to optionally generated
alias.

Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_core.c | 12 ++++++++++++
 include/linux/mdev.h          |  1 +
 2 files changed, 13 insertions(+)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index c8cd40366783..9eec556fbdd4 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -517,6 +517,18 @@ struct device *mdev_get_iommu_device(struct device *dev)
 }
 EXPORT_SYMBOL(mdev_get_iommu_device);
 
+/**
+ * mdev_alias: Return alias string of a mdev device
+ * @mdev:	Pointer to the mdev device
+ * mdev_alias() returns alias string of a mdev device if alias is present,
+ * returns NULL otherwise.
+ */
+const char *mdev_alias(struct mdev_device *mdev)
+{
+	return mdev->alias;
+}
+EXPORT_SYMBOL(mdev_alias);
+
 static int __init mdev_init(void)
 {
 	int ret;
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index f036fe9854ee..6da82213bc4e 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -148,5 +148,6 @@ void mdev_unregister_driver(struct mdev_driver *drv);
 struct device *mdev_parent_dev(struct mdev_device *mdev);
 struct device *mdev_dev(struct mdev_device *mdev);
 struct mdev_device *mdev_from_dev(struct device *dev);
+const char *mdev_alias(struct mdev_device *mdev);
 
 #endif /* MDEV_H */
-- 
2.19.2


^ permalink raw reply related

* [PATCH v3 5/5] mtty: Optionally support mtty alias
From: Parav Pandit @ 2019-09-02  4:24 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190902042436.23294-1-parav@mellanox.com>

Provide a module parameter to set alias length to optionally generate
mdev alias.

Example to request mdev alias.
$ modprobe mtty alias_length=12

Make use of mtty_alias() API when alias_length module parameter is set.

Signed-off-by: Parav Pandit <parav@mellanox.com>
---
Changelog:
v1->v2:
 - Added mdev_alias() usage sample
---
 samples/vfio-mdev/mtty.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/samples/vfio-mdev/mtty.c b/samples/vfio-mdev/mtty.c
index 92e770a06ea2..075d65440bc0 100644
--- a/samples/vfio-mdev/mtty.c
+++ b/samples/vfio-mdev/mtty.c
@@ -150,6 +150,10 @@ static const struct file_operations vd_fops = {
 	.owner          = THIS_MODULE,
 };
 
+static unsigned int mtty_alias_length;
+module_param_named(alias_length, mtty_alias_length, uint, 0444);
+MODULE_PARM_DESC(alias_length, "mdev alias length; default=0");
+
 /* function prototypes */
 
 static int mtty_trigger_interrupt(const guid_t *uuid);
@@ -770,6 +774,9 @@ static int mtty_create(struct kobject *kobj, struct mdev_device *mdev)
 	list_add(&mdev_state->next, &mdev_devices_list);
 	mutex_unlock(&mdev_list_lock);
 
+	if (mtty_alias_length)
+		dev_dbg(mdev_dev(mdev), "alias is %s\n", mdev_alias(mdev));
+
 	return 0;
 }
 
@@ -1410,6 +1417,11 @@ static struct attribute_group *mdev_type_groups[] = {
 	NULL,
 };
 
+static unsigned int mtty_get_alias_length(void)
+{
+	return mtty_alias_length;
+}
+
 static const struct mdev_parent_ops mdev_fops = {
 	.owner                  = THIS_MODULE,
 	.dev_attr_groups        = mtty_dev_groups,
@@ -1422,6 +1434,7 @@ static const struct mdev_parent_ops mdev_fops = {
 	.read                   = mtty_read,
 	.write                  = mtty_write,
 	.ioctl		        = mtty_ioctl,
+	.get_alias_length	= mtty_get_alias_length
 };
 
 static void mtty_device_release(struct device *dev)
-- 
2.19.2


^ permalink raw reply related

* [PATCH v3 1/5] mdev: Introduce sha1 based mdev alias
From: Parav Pandit @ 2019-09-02  4:24 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190902042436.23294-1-parav@mellanox.com>

Some vendor drivers want an identifier for an mdev device that is
shorter than the UUID, due to length restrictions in the consumers of
that identifier.

Add a callback that allows a vendor driver to request an alias of a
specified length to be generated for an mdev device. If generated,
that alias is checked for collisions.

It is an optional attribute.
mdev alias is generated using sha1 from the mdev name.

Signed-off-by: Parav Pandit <parav@mellanox.com>

---
Changelog:
v1->v2:
 - Kept mdev_device naturally aligned
 - Added error checking for crypt_*() calls
 - Corrected a typo from 'and' to 'an'
 - Changed return type of generate_alias() from int to char*
v0->v1:
 - Moved alias length check outside of the parent lock
 - Moved alias and digest allocation from kvzalloc to kzalloc
 - &alias[0] changed to alias
 - alias_length check is nested under get_alias_length callback check
 - Changed comments to start with an empty line
 - Fixed cleaunup of hash if mdev_bus_register() fails
 - Added comment where alias memory ownership is handed over to mdev device
 - Updated commit log to indicate motivation for this feature
---
 drivers/vfio/mdev/mdev_core.c    | 123 ++++++++++++++++++++++++++++++-
 drivers/vfio/mdev/mdev_private.h |   5 +-
 drivers/vfio/mdev/mdev_sysfs.c   |  13 ++--
 include/linux/mdev.h             |   4 +
 4 files changed, 135 insertions(+), 10 deletions(-)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index b558d4cfd082..3bdff0469607 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -10,9 +10,11 @@
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/slab.h>
+#include <linux/mm.h>
 #include <linux/uuid.h>
 #include <linux/sysfs.h>
 #include <linux/mdev.h>
+#include <crypto/hash.h>
 
 #include "mdev_private.h"
 
@@ -27,6 +29,8 @@ static struct class_compat *mdev_bus_compat_class;
 static LIST_HEAD(mdev_list);
 static DEFINE_MUTEX(mdev_list_lock);
 
+static struct crypto_shash *alias_hash;
+
 struct device *mdev_parent_dev(struct mdev_device *mdev)
 {
 	return mdev->parent->dev;
@@ -150,6 +154,16 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
 	if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
 		return -EINVAL;
 
+	if (ops->get_alias_length) {
+		unsigned int digest_size;
+		unsigned int aligned_len;
+
+		aligned_len = roundup(ops->get_alias_length(), 2);
+		digest_size = crypto_shash_digestsize(alias_hash);
+		if (aligned_len / 2 > digest_size)
+			return -EINVAL;
+	}
+
 	dev = get_device(dev);
 	if (!dev)
 		return -EINVAL;
@@ -259,6 +273,7 @@ static void mdev_device_free(struct mdev_device *mdev)
 	mutex_unlock(&mdev_list_lock);
 
 	dev_dbg(&mdev->dev, "MDEV: destroying\n");
+	kfree(mdev->alias);
 	kfree(mdev);
 }
 
@@ -269,18 +284,101 @@ static void mdev_device_release(struct device *dev)
 	mdev_device_free(mdev);
 }
 
-int mdev_device_create(struct kobject *kobj,
-		       struct device *dev, const guid_t *uuid)
+static const char *
+generate_alias(const char *uuid, unsigned int max_alias_len)
+{
+	struct shash_desc *hash_desc;
+	unsigned int digest_size;
+	unsigned char *digest;
+	unsigned int alias_len;
+	char *alias;
+	int ret;
+
+	/*
+	 * Align to multiple of 2 as bin2hex will generate
+	 * even number of bytes.
+	 */
+	alias_len = roundup(max_alias_len, 2);
+	alias = kzalloc(alias_len + 1, GFP_KERNEL);
+	if (!alias)
+		return ERR_PTR(-ENOMEM);
+
+	/* Allocate and init descriptor */
+	hash_desc = kvzalloc(sizeof(*hash_desc) +
+			     crypto_shash_descsize(alias_hash),
+			     GFP_KERNEL);
+	if (!hash_desc) {
+		ret = -ENOMEM;
+		goto desc_err;
+	}
+
+	hash_desc->tfm = alias_hash;
+
+	digest_size = crypto_shash_digestsize(alias_hash);
+
+	digest = kzalloc(digest_size, GFP_KERNEL);
+	if (!digest) {
+		ret = -ENOMEM;
+		goto digest_err;
+	}
+	ret = crypto_shash_init(hash_desc);
+	if (ret)
+		goto hash_err;
+
+	ret = crypto_shash_update(hash_desc, uuid, UUID_STRING_LEN);
+	if (ret)
+		goto hash_err;
+
+	ret = crypto_shash_final(hash_desc, digest);
+	if (ret)
+		goto hash_err;
+
+	bin2hex(alias, digest, min_t(unsigned int, digest_size, alias_len / 2));
+	/*
+	 * When alias length is odd, zero out an additional last byte
+	 * that bin2hex has copied.
+	 */
+	if (max_alias_len % 2)
+		alias[max_alias_len] = 0;
+
+	kfree(digest);
+	kvfree(hash_desc);
+	return alias;
+
+hash_err:
+	kfree(digest);
+digest_err:
+	kvfree(hash_desc);
+desc_err:
+	kfree(alias);
+	return ERR_PTR(ret);
+}
+
+int mdev_device_create(struct kobject *kobj, struct device *dev,
+		       const char *uuid_str, const guid_t *uuid)
 {
 	int ret;
 	struct mdev_device *mdev, *tmp;
 	struct mdev_parent *parent;
 	struct mdev_type *type = to_mdev_type(kobj);
+	const char *alias = NULL;
 
 	parent = mdev_get_parent(type->parent);
 	if (!parent)
 		return -EINVAL;
 
+	if (parent->ops->get_alias_length) {
+		unsigned int alias_len;
+
+		alias_len = parent->ops->get_alias_length();
+		if (alias_len) {
+			alias = generate_alias(uuid_str, alias_len);
+			if (IS_ERR(alias)) {
+				ret = PTR_ERR(alias);
+				goto alias_fail;
+			}
+		}
+	}
 	mutex_lock(&mdev_list_lock);
 
 	/* Check for duplicate */
@@ -300,6 +398,12 @@ int mdev_device_create(struct kobject *kobj,
 	}
 
 	guid_copy(&mdev->uuid, uuid);
+	mdev->alias = alias;
+	/*
+	 * At this point alias memory is owned by the mdev.
+	 * Mark it NULL, so that only mdev can free it.
+	 */
+	alias = NULL;
 	list_add(&mdev->next, &mdev_list);
 	mutex_unlock(&mdev_list_lock);
 
@@ -346,6 +450,8 @@ int mdev_device_create(struct kobject *kobj,
 	up_read(&parent->unreg_sem);
 	put_device(&mdev->dev);
 mdev_fail:
+	kfree(alias);
+alias_fail:
 	mdev_put_parent(parent);
 	return ret;
 }
@@ -406,7 +512,17 @@ EXPORT_SYMBOL(mdev_get_iommu_device);
 
 static int __init mdev_init(void)
 {
-	return mdev_bus_register();
+	int ret;
+
+	alias_hash = crypto_alloc_shash("sha1", 0, 0);
+	if (!alias_hash)
+		return -ENOMEM;
+
+	ret = mdev_bus_register();
+	if (ret)
+		crypto_free_shash(alias_hash);
+
+	return ret;
 }
 
 static void __exit mdev_exit(void)
@@ -415,6 +531,7 @@ static void __exit mdev_exit(void)
 		class_compat_unregister(mdev_bus_compat_class);
 
 	mdev_bus_unregister();
+	crypto_free_shash(alias_hash);
 }
 
 module_init(mdev_init)
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index 7d922950caaf..078fdaf7836e 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -32,6 +32,7 @@ struct mdev_device {
 	struct list_head next;
 	struct kobject *type_kobj;
 	struct device *iommu_device;
+	const char *alias;
 	bool active;
 };
 
@@ -57,8 +58,8 @@ void parent_remove_sysfs_files(struct mdev_parent *parent);
 int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type);
 void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
 
-int  mdev_device_create(struct kobject *kobj,
-			struct device *dev, const guid_t *uuid);
+int mdev_device_create(struct kobject *kobj, struct device *dev,
+		       const char *uuid_str, const guid_t *uuid);
 int  mdev_device_remove(struct device *dev);
 
 #endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
index 7570c7602ab4..43afe0e80b76 100644
--- a/drivers/vfio/mdev/mdev_sysfs.c
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -63,15 +63,18 @@ static ssize_t create_store(struct kobject *kobj, struct device *dev,
 		return -ENOMEM;
 
 	ret = guid_parse(str, &uuid);
-	kfree(str);
 	if (ret)
-		return ret;
+		goto err;
 
-	ret = mdev_device_create(kobj, dev, &uuid);
+	ret = mdev_device_create(kobj, dev, str, &uuid);
 	if (ret)
-		return ret;
+		goto err;
 
-	return count;
+	ret = count;
+
+err:
+	kfree(str);
+	return ret;
 }
 
 MDEV_TYPE_ATTR_WO(create);
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index 0ce30ca78db0..f036fe9854ee 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -72,6 +72,9 @@ struct device *mdev_get_iommu_device(struct device *dev);
  * @mmap:		mmap callback
  *			@mdev: mediated device structure
  *			@vma: vma structure
+ * @get_alias_length:	Generate alias for the mdevs of this parent based on the
+ *			mdev device name when it returns non zero alias length.
+ *			It is optional.
  * Parent device that support mediated device should be registered with mdev
  * module with mdev_parent_ops structure.
  **/
@@ -92,6 +95,7 @@ struct mdev_parent_ops {
 	long	(*ioctl)(struct mdev_device *mdev, unsigned int cmd,
 			 unsigned long arg);
 	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
+	unsigned int (*get_alias_length)(void);
 };
 
 /* interface for exporting mdev supported type attributes */
-- 
2.19.2


^ permalink raw reply related

* [PATCH v3 0/5] Introduce variable length mdev alias
From: Parav Pandit @ 2019-09-02  4:24 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190826204119.54386-1-parav@mellanox.com>

To have consistent naming for the netdevice of a mdev and to have
consistent naming of the devlink port [1] of a mdev, which is formed using
phys_port_name of the devlink port, current UUID is not usable because
UUID is too long.

UUID in string format is 36-characters long and in binary 128-bit.
Both formats are not able to fit within 15 characters limit of netdev
name.

It is desired to have mdev device naming consistent using UUID.
So that widely used user space framework such as ovs [2] can make use
of mdev representor in similar way as PCIe SR-IOV VF and PF representors.

Hence,
(a) mdev alias is created which is derived using sha1 from the mdev name.
(b) Vendor driver describes how long an alias should be for the child mdev
created for a given parent.
(c) Mdev aliases are unique at system level.
(d) alias is created optionally whenever parent requested.
This ensures that non networking mdev parents can function without alias
creation overhead.

This design is discussed at [3].

An example systemd/udev extension will have,

1. netdev name created using mdev alias available in sysfs.

mdev UUID=83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
mdev 12 character alias=cd5b146a80a5

netdev name of this mdev = enmcd5b146a80a5
Here en = Ethernet link
m = mediated device

2. devlink port phys_port_name created using mdev alias.
devlink phys_port_name=pcd5b146a80a5

This patchset enables mdev core to maintain unique alias for a mdev.

Patch-1 Introduces mdev alias using sha1.
Patch-2 Ensures that mdev alias is unique in a system.
Patch-3 Exposes mdev alias in a sysfs hirerchy, update Documentation
Patch-4 Introduces mdev_alias() API.
Patch-5 Extends mtty driver to optionally provide alias generation.
This also enables to test UUID based sha1 collision and trigger
error handling for duplicate sha1 results.

[1] http://man7.org/linux/man-pages/man8/devlink-port.8.html
[2] https://docs.openstack.org/os-vif/latest/user/plugins/ovs.html
[3] https://patchwork.kernel.org/cover/11084231/

---
Changelog:
v2->v3:
 - Addressed comment from Yunsheng Lin
 - Changed strcmp() ==0 to !strcmp()
 - Addressed comment from Cornelia Hunk
 - Merged sysfs Documentation patch with syfs patch
 - Added more description for alias return value
v1->v2:
 - Corrected a typo from 'and' to 'an'
 - Addressed comments from Alex Williamson
 - Kept mdev_device naturally aligned
 - Added error checking for crypt_*() calls
 - Moved alias NULL check at beginning
 - Added mdev_alias() API
 - Updated mtty driver to show example mdev_alias() usage
 - Changed return type of generate_alias() from int to char*
v0->v1:
 - Addressed comments from Alex Williamson, Cornelia Hunk and Mark Bloch
 - Moved alias length check outside of the parent lock
 - Moved alias and digest allocation from kvzalloc to kzalloc
 - &alias[0] changed to alias
 - alias_length check is nested under get_alias_length callback check
 - Changed comments to start with an empty line
 - Added comment where alias memory ownership is handed over to mdev device
 - Fixed cleaunup of hash if mdev_bus_register() fails
 - Updated documentation for new sysfs alias file
 - Improved commit logs to make description more clear
 - Fixed inclusiong of alias for NULL check
 - Added ratelimited debug print for sha1 hash collision error

Parav Pandit (5):
  mdev: Introduce sha1 based mdev alias
  mdev: Make mdev alias unique among all mdevs
  mdev: Expose mdev alias in sysfs tree
  mdev: Introduce an API mdev_alias
  mtty: Optionally support mtty alias

 .../driver-api/vfio-mediated-device.rst       |   9 ++
 drivers/vfio/mdev/mdev_core.c                 | 142 +++++++++++++++++-
 drivers/vfio/mdev/mdev_private.h              |   5 +-
 drivers/vfio/mdev/mdev_sysfs.c                |  26 +++-
 include/linux/mdev.h                          |   5 +
 samples/vfio-mdev/mtty.c                      |  13 ++
 6 files changed, 190 insertions(+), 10 deletions(-)

-- 
2.19.2


^ permalink raw reply

* Re: [bpf-next, v2] samples: bpf: add max_pckt_size option at xdp_adjust_tail
From: Daniel T. Lee @ 2019-09-02  4:43 UTC (permalink / raw)
  To: Song Liu; +Cc: Daniel Borkmann, Alexei Starovoitov, Networking
In-Reply-To: <CAEKGpzhGkLGswP3G9BzY1YErVOuNQRRBD2y=4g7u7dfh1by3aA@mail.gmail.com>

On Fri, Aug 30, 2019 at 3:23 AM Daniel T. Lee <danieltimlee@gmail.com> wrote:
>
> On Fri, Aug 30, 2019 at 5:42 AM Song Liu <liu.song.a23@gmail.com> wrote:
> >
> > On Mon, Aug 26, 2019 at 9:52 AM Daniel T. Lee <danieltimlee@gmail.com> wrote:
> > >
> > > Currently, at xdp_adjust_tail_kern.c, MAX_PCKT_SIZE is limited
> > > to 600. To make this size flexible, a new map 'pcktsz' is added.
> > >
> > > By updating new packet size to this map from the userland,
> > > xdp_adjust_tail_kern.o will use this value as a new max_pckt_size.
> > >
> > > If no '-P <MAX_PCKT_SIZE>' option is used, the size of maximum packet
> > > will be 600 as a default.
> >
> > Please also cc bpf@vger.kernel.org for bpf patches.
> >
>
> I'll make sure to have it included next time.
>
> > >
> > > Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> >
> > Acked-by: Song Liu <songliubraving@fb.com>
> >
> > With a nit below.
> >
> > [...]
> >
> > > diff --git a/samples/bpf/xdp_adjust_tail_user.c b/samples/bpf/xdp_adjust_tail_user.c
> > > index a3596b617c4c..29ade7caf841 100644
> > > --- a/samples/bpf/xdp_adjust_tail_user.c
> > > +++ b/samples/bpf/xdp_adjust_tail_user.c
> > > @@ -72,6 +72,7 @@ static void usage(const char *cmd)
> > >         printf("Usage: %s [...]\n", cmd);
> > >         printf("    -i <ifname|ifindex> Interface\n");
> > >         printf("    -T <stop-after-X-seconds> Default: 0 (forever)\n");
> > > +       printf("    -P <MAX_PCKT_SIZE> Default: 600\n");
> >
> > nit: printf("    -P <MAX_PCKT_SIZE> Default: %u\n", MAX_PCKT_SIZE);
>
> With all due respect, I'm afraid that MAX_PCKT_SIZE constant is only
> defined at '_kern.c'.
> Are you saying that it should be defined at '_user.c' either?
>
> Thanks for the review!

Ping?

^ permalink raw reply

* Re: [PATCH v3] tun: fix use-after-free when register netdev failed
From: Jason Wang @ 2019-09-02  5:32 UTC (permalink / raw)
  To: Yang Yingliang
  Cc: David Miller, netdev, eric dumazet, xiyou wangcong, weiyongjun1
In-Reply-To: <5D5FB3B6.5080800@huawei.com>


On 2019/8/23 下午5:36, Yang Yingliang wrote:
>
>
> On 2019/8/23 11:05, Jason Wang wrote:
>> ----- Original Message -----
>>>
>>> On 2019/8/22 14:07, Yang Yingliang wrote:
>>>>
>>>> On 2019/8/22 10:13, Jason Wang wrote:
>>>>> On 2019/8/20 上午10:28, Jason Wang wrote:
>>>>>> On 2019/8/20 上午9:25, David Miller wrote:
>>>>>>> From: Yang Yingliang <yangyingliang@huawei.com>
>>>>>>> Date: Mon, 19 Aug 2019 21:31:19 +0800
>>>>>>>
>>>>>>>> Call tun_attach() after register_netdevice() to make sure 
>>>>>>>> tfile->tun
>>>>>>>> is not published until the netdevice is registered. So the 
>>>>>>>> read/write
>>>>>>>> thread can not use the tun pointer that may freed by 
>>>>>>>> free_netdev().
>>>>>>>> (The tun and dev pointer are allocated by alloc_netdev_mqs(), they
>>>>>>>> can
>>>>>>>> be freed by netdev_freemem().)
>>>>>>> register_netdevice() must always be the last operation in the 
>>>>>>> order of
>>>>>>> network device setup.
>>>>>>>
>>>>>>> At the point register_netdevice() is called, the device is visible
>>>>>>> globally
>>>>>>> and therefore all of it's software state must be fully 
>>>>>>> initialized and
>>>>>>> ready for us.
>>>>>>>
>>>>>>> You're going to have to find another solution to these problems.
>>>>>>
>>>>>> The device is loosely coupled with sockets/queues. Each side is
>>>>>> allowed to be go away without caring the other side. So in this
>>>>>> case, there's a small window that network stack think the device has
>>>>>> one queue but actually not, the code can then safely drop them.
>>>>>> Maybe it's ok here with some comments?
>>>>>>
>>>>>> Or if not, we can try to hold the device before tun_attach and drop
>>>>>> it after register_netdevice().
>>>>>
>>>>> Hi Yang:
>>>>>
>>>>> I think maybe we can try to hold refcnt instead of playing real num
>>>>> queues here. Do you want to post a V4?
>>>> I think the refcnt can prevent freeing the memory in this case.
>>>> When register_netdevice() failed, free_netdev() will be called 
>>>> directly,
>>>> dev->pcpu_refcnt and dev are freed without checking refcnt of dev.
>>> How about using patch-v1 that using a flag to check whether the device
>>> registered successfully.
>>>
>> As I said, it lacks sufficient locks or barriers. To be clear, I meant
>> something like (compile-test only):
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index db16d7a13e00..e52678f9f049 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -2828,6 +2828,7 @@ static int tun_set_iff(struct net *net, struct 
>> file *file, struct ifreq *ifr)
>>                                (ifr->ifr_flags & TUN_FEATURES);
>>                    INIT_LIST_HEAD(&tun->disabled);
>> +               dev_hold(dev);
>>                  err = tun_attach(tun, file, false, ifr->ifr_flags & 
>> IFF_NAPI,
>>                                   ifr->ifr_flags & IFF_NAPI_FRAGS);
>>                  if (err < 0)
>> @@ -2836,6 +2837,7 @@ static int tun_set_iff(struct net *net, struct 
>> file *file, struct ifreq *ifr)
>>                  err = register_netdevice(tun->dev);
>>                  if (err < 0)
>>                          goto err_detach;
>> +               dev_put(dev);
>>          }
>>            netif_carrier_on(tun->dev);
>> @@ -2852,11 +2854,13 @@ static int tun_set_iff(struct net *net, 
>> struct file *file, struct ifreq *ifr)
>>          return 0;
>>     err_detach:
>> +       dev_put(dev);
>>          tun_detach_all(dev);
>>          /* register_netdevice() already called tun_free_netdev() */
>>          goto err_free_dev;
>>     err_free_flow:
>> +       dev_put(dev);
>>          tun_flow_uninit(tun);
>>          security_tun_dev_free_security(tun->security);
>>   err_free_stat:
>>
>> What's your thought?
>
> The dev pointer are freed without checking the refcount in 
> free_netdev() called by err_free_dev
>
> path, so I don't understand how the refcount protects this pointer.
>

The refcount are guaranteed to be zero there, isn't it?

Thanks


> Thanks,
> Yang
>
>>
>> Thanks
>>
>> .
>>
>
>

^ permalink raw reply

* Re: [PATCH net-next 3/3] net: phy: realtek: add support for the 2.5Gbps PHY in RTL8125
From: Heiner Kallweit @ 2019-09-02  6:07 UTC (permalink / raw)
  To: Florian Fainelli, Andrew Lunn; +Cc: David Miller, netdev@vger.kernel.org
In-Reply-To: <fafc1c05-d7ac-f108-74f9-207617773968@gmail.com>

On 02.09.2019 04:07, Florian Fainelli wrote:
> 
> 
> On 8/8/2019 1:24 PM, Heiner Kallweit wrote:
>> On 08.08.2019 22:20, Andrew Lunn wrote:
>>>> I have a contact in Realtek who provided the information about
>>>> the vendor-specific registers used in the patch. I also asked for
>>>> a method to auto-detect 2.5Gbps support but have no feedback so far.
>>>> What may contribute to the problem is that also the integrated 1Gbps
>>>> PHY's (all with the same PHY ID) differ significantly from each other,
>>>> depending on the network chip version.
>>>
>>> Hi Heiner
>>>
>>> Some of the PHYs embedded in Marvell switches have an OUI, but no
>>> product ID. We work around this brokenness by trapping the reads to
>>> the ID registers in the MDIO bus controller driver and inserting the
>>> switch product ID. The Marvell PHY driver then recognises these IDs
>>> and does the right thing.
>>>
>>> Maybe you can do something similar here?
>>>
>> Yes, this would be an idea. Let me check.
> 
> Since this is an integrated PHY you could have the MAC driver pass a
> specific phydev->dev_flag bit that indicates that this is RTL8215, since
> I am assuming that PCI IDs for those different chipsets do have to be
> allocated, right?
> 
Hi Florian,

thanks for the feedback. In the meantime Realtek provided a method to
identify NBaseT-capable PHY's, and the respective match_phy_device
callback implementations had been done in
5181b473d64e ("net: phy: realtek: add NBase-T PHY auto-detection").

Heiner

^ permalink raw reply

* [PATCH bpf-next] arm64: bpf: optimize modulo operation
From: jerinj @ 2019-09-02  6:14 UTC (permalink / raw)
  To: netdev, Daniel Borkmann, Alexei Starovoitov, Zi Shen Lim,
	Catalin Marinas, Will Deacon, Martin KaFai Lau, Song Liu,
	Yonghong Song, open list:BPF JIT for ARM64,
	moderated list:ARM64 PORT (AARCH64 ARCHITECTURE), open list
  Cc: Jerin Jacob

From: Jerin Jacob <jerinj@marvell.com>

Optimize modulo operation instruction generation by
using single MSUB instruction vs MUL followed by SUB
instruction scheme.

Signed-off-by: Jerin Jacob <jerinj@marvell.com>
---
 arch/arm64/net/bpf_jit.h      | 3 +++
 arch/arm64/net/bpf_jit_comp.c | 6 ++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/net/bpf_jit.h b/arch/arm64/net/bpf_jit.h
index cb7ab50b7657..eb73f9f72c46 100644
--- a/arch/arm64/net/bpf_jit.h
+++ b/arch/arm64/net/bpf_jit.h
@@ -171,6 +171,9 @@
 /* Rd = Ra + Rn * Rm */
 #define A64_MADD(sf, Rd, Ra, Rn, Rm) aarch64_insn_gen_data3(Rd, Ra, Rn, Rm, \
 	A64_VARIANT(sf), AARCH64_INSN_DATA3_MADD)
+/* Rd = Ra - Rn * Rm */
+#define A64_MSUB(sf, Rd, Ra, Rn, Rm) aarch64_insn_gen_data3(Rd, Ra, Rn, Rm, \
+	A64_VARIANT(sf), AARCH64_INSN_DATA3_MSUB)
 /* Rd = Rn * Rm */
 #define A64_MUL(sf, Rd, Rn, Rm) A64_MADD(sf, Rd, A64_ZR, Rn, Rm)
 
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index f5b437f8a22b..cdc79de0c794 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -409,8 +409,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 			break;
 		case BPF_MOD:
 			emit(A64_UDIV(is64, tmp, dst, src), ctx);
-			emit(A64_MUL(is64, tmp, tmp, src), ctx);
-			emit(A64_SUB(is64, dst, dst, tmp), ctx);
+			emit(A64_MSUB(is64, dst, dst, tmp, src), ctx);
 			break;
 		}
 		break;
@@ -516,8 +515,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 	case BPF_ALU64 | BPF_MOD | BPF_K:
 		emit_a64_mov_i(is64, tmp2, imm, ctx);
 		emit(A64_UDIV(is64, tmp, dst, tmp2), ctx);
-		emit(A64_MUL(is64, tmp, tmp, tmp2), ctx);
-		emit(A64_SUB(is64, dst, dst, tmp), ctx);
+		emit(A64_MSUB(is64, dst, dst, tmp, tmp2), ctx);
 		break;
 	case BPF_ALU | BPF_LSH | BPF_K:
 	case BPF_ALU64 | BPF_LSH | BPF_K:
-- 
2.23.0


^ permalink raw reply related

* RE: Proposal: r8152 firmware patching framework
From: Hayes Wang @ 2019-09-02  6:31 UTC (permalink / raw)
  To: Amber Chen, Prashant Malani
  Cc: David Miller, netdev@vger.kernel.org, Bambi Yeh, Ryankao, Jackc,
	Albertk, marcochen@google.com, nic_swsd, Grant Grundler
In-Reply-To: <755AFD2B-D66F-40FF-ADCD-5077ECC569FE@realtek.com>

Prashant Malani <pmalani@chromium.org> 
> >
> > (Adding a few more Realtek folks)
> >
> > Friendly ping. Any thoughts / feedback, Realtek folks (and others) ?
> >
> >> On Thu, Aug 29, 2019 at 11:40 AM Prashant Malani
> <pmalani@chromium.org> wrote:
> >>
> >> Hi,
> >>
> >> The r8152 driver source code distributed by Realtek (on
> >> www.realtek.com) contains firmware patches. This involves binary
> >> byte-arrays being written byte/word-wise to the hardware memory
> >> Example: grundler@chromium.org (cc-ed) has an experimental patch
> which
> >> includes the firmware patching code which was distributed with the
> >> Realtek source :
> >>
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel
> /+/1417953
> >>
> >> It would be nice to have a way to incorporate these firmware fixes
> >> into the upstream code. Since having indecipherable byte-arrays is not
> >> possible upstream, I propose the following:
> >> - We use the assistance of Realtek to come up with a format which the
> >> firmware patch files can follow (this can be documented in the
> >> comments).
> >>       - A real simple format could look like this:
> >>               +
> >>
> <section1><size_in_bytes><address1><data1><address2><data2>...<addressN
> ><dataN><section2>...
> >>                + The driver would be able to understand how to parse
> >> each section (e.g is each data entry a byte or a word?)
> >>
> >> - We use request_firmware() to load the firmware, parse it and write
> >> the data to the relevant registers.

I plan to finish the patches which I am going to submit, first. Then,
I could focus on this. However, I don't think I would start this
quickly. There are many preparations and they would take me a lot of
time.

Best Regards,
Hayes



^ permalink raw reply

* Re: [PATCH mlx5-next 0/5] Mellanox, mlx5 next updates 2019-09-29
From: Saeed Mahameed @ 2019-09-02  6:45 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: netdev@vger.kernel.org, linux-rdma@vger.kernel.org
In-Reply-To: <20190829234151.9958-1-saeedm@mellanox.com>

On Thu, 2019-08-29 at 23:42 +0000, Saeed Mahameed wrote:
> Hi All,
> 
> This series includes misc updates for mlx5-next shared branch
> required
> for upcoming software steering feature.
> 
> 1) Alex adds HW bits and definitions required for SW steering
> 2) Ariel moves device memory management to mlx5_core (From mlx5_ib)
> 3) Maor, Cleanups and fixups for eswitch mode and RoCE
> 4) Mar, Set only stag for match untagged packets
> 
> In case of no objection this series will be applied to mlx5-next
> branch
> and sent later as pull request to both rdma-next and net-next
> branches.
> 

Applied to mlx5-next
Thanks,
Saeed

^ permalink raw reply

* Re: [PATCH V2 4/4] crypto: Add Xilinx AES driver
From: Corentin Labbe @ 2019-09-02  6:58 UTC (permalink / raw)
  To: Kalyani Akula
  Cc: herbert, kstewart, gregkh, tglx, pombredanne, linux-crypto,
	linux-kernel, netdev, Kalyani Akula
In-Reply-To: <1567346098-27927-5-git-send-email-kalyani.akula@xilinx.com>

On Sun, Sep 01, 2019 at 07:24:58PM +0530, Kalyani Akula wrote:
> This patch adds AES driver support for the Xilinx
> ZynqMP SoC.
> 
> Signed-off-by: Kalyani Akula <kalyani.akula@xilinx.com>
> ---

Hello

I have some comment below

>  drivers/crypto/Kconfig          |  11 ++
>  drivers/crypto/Makefile         |   1 +
>  drivers/crypto/zynqmp-aes-gcm.c | 297 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 309 insertions(+)
>  create mode 100644 drivers/crypto/zynqmp-aes-gcm.c
> 
> diff --git a/drivers/crypto/Kconfig b/drivers/crypto/Kconfig
> index 603413f..a0d058a 100644
> --- a/drivers/crypto/Kconfig
> +++ b/drivers/crypto/Kconfig
> @@ -677,6 +677,17 @@ config CRYPTO_DEV_ROCKCHIP
>  	  This driver interfaces with the hardware crypto accelerator.
>  	  Supporting cbc/ecb chainmode, and aes/des/des3_ede cipher mode.
>  
> +config CRYPTO_DEV_ZYNQMP_AES
> +	tristate "Support for Xilinx ZynqMP AES hw accelerator"
> +	depends on ARCH_ZYNQMP || COMPILE_TEST
> +	select CRYPTO_AES
> +	select CRYPTO_SKCIPHER
> +	help
> +	  Xilinx ZynqMP has AES-GCM engine used for symmetric key
> +	  encryption and decryption. This driver interfaces with AES hw
> +	  accelerator. Select this if you want to use the ZynqMP module
> +	  for AES algorithms.
> +
>  config CRYPTO_DEV_MEDIATEK
>  	tristate "MediaTek's EIP97 Cryptographic Engine driver"
>  	depends on (ARM && ARCH_MEDIATEK) || COMPILE_TEST
> diff --git a/drivers/crypto/Makefile b/drivers/crypto/Makefile
> index afc4753..c99663a 100644
> --- a/drivers/crypto/Makefile
> +++ b/drivers/crypto/Makefile
> @@ -48,3 +48,4 @@ obj-$(CONFIG_CRYPTO_DEV_BCM_SPU) += bcm/
>  obj-$(CONFIG_CRYPTO_DEV_SAFEXCEL) += inside-secure/
>  obj-$(CONFIG_CRYPTO_DEV_ARTPEC6) += axis/
>  obj-y += hisilicon/
> +obj-$(CONFIG_CRYPTO_DEV_ZYNQMP_AES) += zynqmp-aes-gcm.o
> diff --git a/drivers/crypto/zynqmp-aes-gcm.c b/drivers/crypto/zynqmp-aes-gcm.c
> new file mode 100644
> index 0000000..d65f038
> --- /dev/null
> +++ b/drivers/crypto/zynqmp-aes-gcm.c
> @@ -0,0 +1,297 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Xilinx ZynqMP AES Driver.
> + * Copyright (c) 2019 Xilinx Inc.
> + */
> +
> +#include <crypto/aes.h>
> +#include <crypto/scatterwalk.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/of_device.h>
> +#include <linux/scatterlist.h>
> +#include <linux/firmware/xlnx-zynqmp.h>
> +
> +#define ZYNQMP_AES_IV_SIZE			12
> +#define ZYNQMP_AES_GCM_SIZE			16
> +#define ZYNQMP_AES_KEY_SIZE			32
> +
> +#define ZYNQMP_AES_DECRYPT			0
> +#define ZYNQMP_AES_ENCRYPT			1
> +
> +#define ZYNQMP_AES_KUP_KEY			0
> +#define ZYNQMP_AES_DEVICE_KEY			1
> +#define ZYNQMP_AES_PUF_KEY			2
> +
> +#define ZYNQMP_AES_GCM_TAG_MISMATCH_ERR		0x01
> +#define ZYNQMP_AES_SIZE_ERR			0x06
> +#define ZYNQMP_AES_WRONG_KEY_SRC_ERR		0x13
> +#define ZYNQMP_AES_PUF_NOT_PROGRAMMED		0xE300
> +
> +#define ZYNQMP_AES_BLOCKSIZE			0x04
> +
> +static const struct zynqmp_eemi_ops *eemi_ops;
> +struct zynqmp_aes_dev *aes_dd;

I still think that using a global variable for storing device driver data is bad.

> +
> +struct zynqmp_aes_dev {
> +	struct device *dev;
> +};
> +
> +struct zynqmp_aes_op {
> +	struct zynqmp_aes_dev *dd;
> +	void *src;
> +	void *dst;
> +	int len;
> +	u8 key[ZYNQMP_AES_KEY_SIZE];
> +	u8 *iv;
> +	u32 keylen;
> +	u32 keytype;
> +};
> +
> +struct zynqmp_aes_data {
> +	u64 src;
> +	u64 iv;
> +	u64 key;
> +	u64 dst;
> +	u64 size;
> +	u64 optype;
> +	u64 keysrc;
> +};
> +
> +static int zynqmp_setkey_blk(struct crypto_tfm *tfm, const u8 *key,
> +			     unsigned int len)
> +{
> +	struct zynqmp_aes_op *op = crypto_tfm_ctx(tfm);
> +
> +	if (((len != 1) && (len !=  ZYNQMP_AES_KEY_SIZE)) || (!key))

typo, two space

> +		return -EINVAL;
> +
> +	if (len == 1) {
> +		op->keytype = *key;
> +
> +		if ((op->keytype < ZYNQMP_AES_KUP_KEY) ||
> +			(op->keytype > ZYNQMP_AES_PUF_KEY))
> +			return -EINVAL;
> +
> +	} else if (len == ZYNQMP_AES_KEY_SIZE) {
> +		op->keytype = ZYNQMP_AES_KUP_KEY;
> +		op->keylen = len;
> +		memcpy(op->key, key, len);
> +	}
> +
> +	return 0;
> +}

It seems your driver does not support AES keysize of 128/196, you need to fallback in that case.
You need to comment the keylen=1 usecase and use a define for this value.

> +
> +static int zynqmp_aes_xcrypt(struct blkcipher_desc *desc,
> +			     struct scatterlist *dst,
> +			     struct scatterlist *src,
> +			     unsigned int nbytes,
> +			     unsigned int flags)
> +{
> +	struct zynqmp_aes_op *op = crypto_blkcipher_ctx(desc->tfm);
> +	struct zynqmp_aes_dev *dd = aes_dd;
> +	int err, ret, copy_bytes, src_data = 0, dst_data = 0;
> +	dma_addr_t dma_addr, dma_addr_buf;
> +	struct zynqmp_aes_data *abuf;
> +	struct blkcipher_walk walk;
> +	unsigned int data_size;
> +	size_t dma_size;
> +	char *kbuf;
> +
> +	if (!eemi_ops->aes)
> +		return -ENOTSUPP;
> +
> +	if (op->keytype == ZYNQMP_AES_KUP_KEY)
> +		dma_size = nbytes + ZYNQMP_AES_KEY_SIZE
> +			+ ZYNQMP_AES_IV_SIZE;
> +	else
> +		dma_size = nbytes + ZYNQMP_AES_IV_SIZE;
> +
> +	kbuf = dma_alloc_coherent(dd->dev, dma_size, &dma_addr, GFP_KERNEL);
> +	if (!kbuf)
> +		return -ENOMEM;
> +
> +	abuf = dma_alloc_coherent(dd->dev, sizeof(struct zynqmp_aes_data),
> +				  &dma_addr_buf, GFP_KERNEL);
> +	if (!abuf) {
> +		dma_free_coherent(dd->dev, dma_size, kbuf, dma_addr);
> +		return -ENOMEM;
> +	}
> +
> +	data_size = nbytes;
> +	blkcipher_walk_init(&walk, dst, src, data_size);
> +	err = blkcipher_walk_virt(desc, &walk);
> +	op->iv = walk.iv;
> +
> +	while ((nbytes = walk.nbytes)) {
> +		op->src = walk.src.virt.addr;
> +		memcpy(kbuf + src_data, op->src, nbytes);
> +		src_data = src_data + nbytes;
> +		nbytes &= (ZYNQMP_AES_BLOCKSIZE - 1);
> +		err = blkcipher_walk_done(desc, &walk, nbytes);
> +	}
> +	memcpy(kbuf + data_size, op->iv, ZYNQMP_AES_IV_SIZE);
> +	abuf->src = dma_addr;
> +	abuf->dst = dma_addr;
> +	abuf->iv = abuf->src + data_size;
> +	abuf->size = data_size - ZYNQMP_AES_GCM_SIZE;
> +	abuf->optype = flags;
> +	abuf->keysrc = op->keytype;
> +
> +	if (op->keytype == ZYNQMP_AES_KUP_KEY) {
> +		memcpy(kbuf + data_size + ZYNQMP_AES_IV_SIZE,
> +		       op->key, ZYNQMP_AES_KEY_SIZE);
> +
> +		abuf->key = abuf->src + data_size + ZYNQMP_AES_IV_SIZE;
> +	} else {
> +		abuf->key = 0;
> +	}
> +	eemi_ops->aes(dma_addr_buf, &ret);
> +
> +	if (ret != 0) {
> +		switch (ret) {
> +		case ZYNQMP_AES_GCM_TAG_MISMATCH_ERR:
> +			dev_err(dd->dev, "ERROR: Gcm Tag mismatch\n\r");
> +			break;
> +		case ZYNQMP_AES_SIZE_ERR:
> +			dev_err(dd->dev, "ERROR : Non word aligned data\n\r");
> +			break;
> +		case ZYNQMP_AES_WRONG_KEY_SRC_ERR:
> +			dev_err(dd->dev, "ERROR: Wrong KeySrc, enable secure mode\n\r");
> +			break;
> +		case ZYNQMP_AES_PUF_NOT_PROGRAMMED:
> +			dev_err(dd->dev, "ERROR: PUF is not registered\r\n");
> +			break;
> +		default:
> +			dev_err(dd->dev, "ERROR: Invalid");
> +			break;
> +		}
> +		goto END;
> +	}
> +	if (flags)
> +		copy_bytes = data_size;
> +	else
> +		copy_bytes = data_size - ZYNQMP_AES_GCM_SIZE;
> +
> +	blkcipher_walk_init(&walk, dst, src, copy_bytes);
> +	err = blkcipher_walk_virt(desc, &walk);
> +
> +	while ((nbytes = walk.nbytes)) {
> +		memcpy(walk.dst.virt.addr, kbuf + dst_data, nbytes);
> +		dst_data = dst_data + nbytes;
> +		nbytes &= (ZYNQMP_AES_BLOCKSIZE - 1);
> +		err = blkcipher_walk_done(desc, &walk, nbytes);
> +	}
> +END:
> +	memset(kbuf, 0, dma_size);
> +	memset(abuf, 0, sizeof(struct zynqmp_aes_data));
> +	dma_free_coherent(dd->dev, dma_size, kbuf, dma_addr);
> +	dma_free_coherent(dd->dev, sizeof(struct zynqmp_aes_data),
> +			  abuf, dma_addr_buf);
> +	return err;
> +}
> +
> +static int zynqmp_aes_decrypt(struct blkcipher_desc *desc,
> +			      struct scatterlist *dst,
> +			      struct scatterlist *src,
> +			      unsigned int nbytes)
> +{
> +	return zynqmp_aes_xcrypt(desc, dst, src, nbytes, ZYNQMP_AES_DECRYPT);
> +}
> +
> +static int zynqmp_aes_encrypt(struct blkcipher_desc *desc,
> +			      struct scatterlist *dst,
> +			      struct scatterlist *src,
> +			      unsigned int nbytes)
> +{
> +	return zynqmp_aes_xcrypt(desc, dst, src, nbytes, ZYNQMP_AES_ENCRYPT);
> +}
> +
> +static struct crypto_alg zynqmp_alg = {
> +	.cra_name		=	"xilinx-zynqmp-aes",
> +	.cra_driver_name	=	"zynqmp-aes-gcm",
> +	.cra_priority		=	400,
> +	.cra_flags		=	CRYPTO_ALG_TYPE_BLKCIPHER |
> +					CRYPTO_ALG_KERN_DRIVER_ONLY,
> +	.cra_blocksize		=	ZYNQMP_AES_BLOCKSIZE,
> +	.cra_ctxsize		=	sizeof(struct zynqmp_aes_op),
> +	.cra_alignmask		=	15,
> +	.cra_type		=	&crypto_blkcipher_type,
> +	.cra_module		=	THIS_MODULE,
> +	.cra_u			=	{
> +	.blkcipher	=	{
> +			.min_keysize	=	0,

Are you sure to accept this a keysize of 0 ?

Regards

^ permalink raw reply

* [pull request][net-next 00/18] Mellanox, mlx5 software managed steering
From: Saeed Mahameed @ 2019-09-02  7:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Alex Vesker, Erez Shitrit, Saeed Mahameed

Hi Dave,

This series adds the support for software (driver managed) flow steering.
For more information please see tag log below.

Please pull and let me know if there is any problem.

Please note that the series starts with a merge of mlx5-next branch,
to resolve and avoid dependency with rdma tree.

Thanks,
Saeed.

---
The following changes since commit a06ebb8d953b4100236f3057be51d67640e06323:

  Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux (2019-09-02 00:16:05 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2019-09-01

for you to fetch changes up to 6208aecc121dde491047008f8dbc1e734c8e634b:

  net/mlx5: Add devlink flow_steering_mode parameter (2019-09-02 00:16:14 -0700)

----------------------------------------------------------------
mlx5-updates-2019-09-01  (Software steering support)

Abstract:
--------
Mellanox ConnetX devices supports packet matching, packet modification and
redirection. These functionalities are also referred to as flow-steering.
To configure a steering rule, the rule is written to the device owned
memory, this memory is accessed and cached by the device when processing
a packet.
Steering rules are constructed from multiple steering entries (STE).

Rules are configured using the Firmware command interface. The Firmware
processes the given driver command and translates them to STEs, then
writes them to the device memory in the current steering tables.
This process is slow due to the architecture of the command interface and
the processing complexity of each rule.

The highlight of this patchset is to cut the middle man (The firmware) and
do steering rules programming into device directly from the driver, with
no firmware intervention whatsoever.

Motivation:
-----------
Software (driver managed) steering allows for high rule insertion rates
compared to the FW steering described above, this is achieved by using
internal RDMA writes to the device owned memory instead of the slow
command interface to program steering rules.

Software (driver managed) steering, doesn't depend on new FW
for new steering functionality, new implementations can be done in the
driver skipping the FW layer.

Performance:
------------
The insertion rate on a single core using the new approach allows
programming ~300K rules per sec.

Test: TC L2 rules
33K/s with Software steering (this patchset).
5K/s  with FW and current driver.
This will improve OVS based solution performance.

Architecture and implementation details:
----------------------------------------
Software steering will be dynamically selected via devlink device
parameter. Example:
$ devlink dev param show pci/0000:06:00.0 name flow_steering_mode
          pci/0000:06:00.0:
          name flow_steering_mode type driver-specific
          values:
             cmode runtime value smfs

mlx5 software steering module a.k.a (DR - Direct Rule) is implemented
and contained in mlx5/core/steering directory and controlled by
MLX5_SW_STEERING kconfig flag.

mlx5 core steering layer (fs_core) already provides a shim layer for
implementing different steering mechanisms, software steering will
leverage that as seen at the end of this series.

When Software Steering for a specific steering domain
(NIC/RDMA/Vport/ESwitch, etc ..) is supported, it will cause rules
targeting this domain to be created using  SW steering instead of FW.

The implementation includes:
Domain - The steering domain is the object that all other object resides
    in. It holds the memory allocator, send engine, locks and other shared
    data needed by lower objects such as table, matcher, rule, action.
    Each domain can contain multiple tables. Domain is equivalent to
    namespaces e.g (NIC/RDMA/Vport/ESwitch, etc ..) as implemented
    currently in mlx5_core fs_core (flow steering core).

Table - Table objects are used for holding multiple matchers, each table
    has a level used to prevent processing loops. Packets are being
    directed to this table once it is set as the root table, this is done
    by fs_core using a FW command. A packet is being processed inside the
    table matcher by matcher until a successful hit, otherwise the packet
    will perform the default action.

Matcher - Matchers objects are used to specify the fields mask for
    matching when processing a packet. A matcher belongs to a table, each
    matcher can hold multiple rules, each rule with different matching
    values corresponding to the matcher mask. Each matcher has a priority
    used for rule processing order inside the table.

Action - Action objects are created to specify different steering actions
    such as count, reformat (encapsulate, decapsulate, ...), modify
    header, forward to table and many other actions. When creating a rule
    a sequence of actions can be provided to be executed on a successful
    match.

Rule - Rule objects are used to specify a specific match on packets as
    well as the actions that should be executed. A rule belongs to a
    matcher.

STE - This layer is used to hold the specific STE format for the device
    and to convert the requested rule to STEs. Each rule is constructed of
    an STE chain, Multiple rules construct a steering graph. Each node in
    the graph is a hash table containing multiple STEs. The index of each
    STE in the hash table is being calculated using a CRC32 hash function.

Memory pool - Used for managing and caching device owned memory for rule
    insertion. The memory is being allocated using DM (device memory) API.

Communication with device - layer for standard RDMA operation using  RC QP
    to configure the device steering.

Command utility - This module holds all of the FW commands that are
    required for SW steering to function.

Patch planning and files:
-------------------------
1) First patch, adds the support to Add flow steering actions to fs_cmd
shim layer.

2) Next 12 patch will add a file per each Software steering
functionality/module as described above. (See patches with title: DR, *)

3) Add CONFIG_MLX5_SW_STEERING for software steering support and enable
build with the new files

4) Next two patches will add the support for software steering in mlx5
steering shim layer
net/mlx5: Add API to set the namespace steering mode
net/mlx5: Add direct rule fs_cmd implementation

5) Last two patches will add the new devlink parameter to select mlx5
steering mode, will be valid only for switchdev mode for now.
Two modes are supported:
    1. DMFS - Device managed flow steering
    2. SMFS - Software/Driver managed flow steering.

    In the DMFS mode, the HW steering entities are created through the
    FW. In the SMFS mode this entities are created though the driver
    directly.

    The driver will use the devlink steering mode only if the steering
    domain supports it, for now SMFS will manages only the switchdev
    eswitch steering domain.

    User command examples:
    - Set SMFS flow steering mode::

        $ devlink dev param set pci/0000:06:00.0 name flow_steering_mode value "smfs" cmode runtime

    - Read device flow steering mode::

        $ devlink dev param show pci/0000:06:00.0 name flow_steering_mode
          pci/0000:06:00.0:
          name flow_steering_mode type driver-specific
          values:
             cmode runtime value smfs

----------------------------------------------------------------
Alex Vesker (13):
      net/mlx5: DR, Add the internal direct rule types definitions
      net/mlx5: DR, Add direct rule command utilities
      net/mlx5: DR, ICM pool memory allocator
      net/mlx5: DR, Expose an internal API to issue RDMA operations
      net/mlx5: DR, Add Steering entry (STE) utilities
      net/mlx5: DR, Expose steering domain functionality
      net/mlx5: DR, Expose steering table functionality
      net/mlx5: DR, Expose steering matcher functionality
      net/mlx5: DR, Expose steering action functionality
      net/mlx5: DR, Expose steering rule functionality
      net/mlx5: DR, Add required FW steering functionality
      net/mlx5: DR, Expose APIs for direct rule managing
      net/mlx5: DR, Add CONFIG_MLX5_SW_STEERING for software steering support

Maor Gottlieb (5):
      net/mlx5: Add flow steering actions to fs_cmd shim layer
      net/mlx5: Add direct rule fs_cmd implementation
      net/mlx5: Add API to set the namespace steering mode
      net/mlx5: Add support to use SMFS in switchdev mode
      net/mlx5: Add devlink flow_steering_mode parameter

 .../networking/device_drivers/mellanox/mlx5.rst    |   33 +
 drivers/infiniband/hw/mlx5/flow.c                  |   18 +-
 drivers/infiniband/hw/mlx5/main.c                  |    7 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h               |    5 +-
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig    |    7 +
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |    7 +
 drivers/net/ethernet/mellanox/mlx5/core/devlink.c  |  112 +-
 .../net/ethernet/mellanox/mlx5/core/en/tc_tun.c    |   27 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.h   |    2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    |   46 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |    7 +-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |   87 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c   |  116 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h   |   25 +
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |  160 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h  |   39 +-
 .../ethernet/mellanox/mlx5/core/steering/Makefile  |    2 +
 .../mellanox/mlx5/core/steering/dr_action.c        | 1588 ++++++++++++++
 .../ethernet/mellanox/mlx5/core/steering/dr_cmd.c  |  480 ++++
 .../mellanox/mlx5/core/steering/dr_crc32.c         |   98 +
 .../mellanox/mlx5/core/steering/dr_domain.c        |  395 ++++
 .../ethernet/mellanox/mlx5/core/steering/dr_fw.c   |   93 +
 .../mellanox/mlx5/core/steering/dr_icm_pool.c      |  570 +++++
 .../mellanox/mlx5/core/steering/dr_matcher.c       |  770 +++++++
 .../ethernet/mellanox/mlx5/core/steering/dr_rule.c | 1243 +++++++++++
 .../ethernet/mellanox/mlx5/core/steering/dr_send.c |  976 +++++++++
 .../ethernet/mellanox/mlx5/core/steering/dr_ste.c  | 2308 ++++++++++++++++++++
 .../mellanox/mlx5/core/steering/dr_table.c         |  294 +++
 .../mellanox/mlx5/core/steering/dr_types.h         | 1060 +++++++++
 .../ethernet/mellanox/mlx5/core/steering/fs_dr.c   |  600 +++++
 .../ethernet/mellanox/mlx5/core/steering/fs_dr.h   |   60 +
 .../mellanox/mlx5/core/steering/mlx5_ifc_dr.h      |  604 +++++
 .../ethernet/mellanox/mlx5/core/steering/mlx5dr.h  |  212 ++
 include/linux/mlx5/fs.h                            |   33 +-
 34 files changed, 11964 insertions(+), 120 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/Makefile
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_action.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_cmd.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_crc32.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_domain.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_fw.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_matcher.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_rule.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_ste.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_table.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_types.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/fs_dr.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/fs_dr.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/mlx5_ifc_dr.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/mlx5dr.h

^ permalink raw reply

* [net-next 03/18] net/mlx5: DR, Add direct rule command utilities
From: Saeed Mahameed @ 2019-09-02  7:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Alex Vesker, Erez Shitrit, Mark Bloch,
	Saeed Mahameed
In-Reply-To: <20190902072213.7683-1-saeedm@mellanox.com>

From: Alex Vesker <valex@mellanox.com>

Add direct rule command utilities which consists of all the FW
commands that are executed to provide the SW steering functionality.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../mellanox/mlx5/core/steering/dr_cmd.c      | 480 ++++++++++++++
 .../mellanox/mlx5/core/steering/mlx5_ifc_dr.h | 604 ++++++++++++++++++
 2 files changed, 1084 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_cmd.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/mlx5_ifc_dr.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_cmd.c
new file mode 100644
index 000000000000..41662c4e2664
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_cmd.c
@@ -0,0 +1,480 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2019 Mellanox Technologies. */
+
+#include "dr_types.h"
+
+int mlx5dr_cmd_query_esw_vport_context(struct mlx5_core_dev *mdev,
+				       bool other_vport,
+				       u16 vport_number,
+				       u64 *icm_address_rx,
+				       u64 *icm_address_tx)
+{
+	u32 out[MLX5_ST_SZ_DW(query_esw_vport_context_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(query_esw_vport_context_in)] = {};
+	int err;
+
+	MLX5_SET(query_esw_vport_context_in, in, opcode,
+		 MLX5_CMD_OP_QUERY_ESW_VPORT_CONTEXT);
+	MLX5_SET(query_esw_vport_context_in, in, other_vport, other_vport);
+	MLX5_SET(query_esw_vport_context_in, in, vport_number, vport_number);
+
+	err = mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+	if (err)
+		return err;
+
+	*icm_address_rx =
+		MLX5_GET64(query_esw_vport_context_out, out,
+			   esw_vport_context.sw_steering_vport_icm_address_rx);
+	*icm_address_tx =
+		MLX5_GET64(query_esw_vport_context_out, out,
+			   esw_vport_context.sw_steering_vport_icm_address_tx);
+	return 0;
+}
+
+int mlx5dr_cmd_query_gvmi(struct mlx5_core_dev *mdev, bool other_vport,
+			  u16 vport_number, u16 *gvmi)
+{
+	u32 in[MLX5_ST_SZ_DW(query_hca_cap_in)] = {};
+	int out_size;
+	void *out;
+	int err;
+
+	out_size = MLX5_ST_SZ_BYTES(query_hca_cap_out);
+	out = kzalloc(out_size, GFP_KERNEL);
+	if (!out)
+		return -ENOMEM;
+
+	MLX5_SET(query_hca_cap_in, in, opcode, MLX5_CMD_OP_QUERY_HCA_CAP);
+	MLX5_SET(query_hca_cap_in, in, other_function, other_vport);
+	MLX5_SET(query_hca_cap_in, in, function_id, vport_number);
+	MLX5_SET(query_hca_cap_in, in, op_mod,
+		 MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE << 1 |
+		 HCA_CAP_OPMOD_GET_CUR);
+
+	err = mlx5_cmd_exec(mdev, in, sizeof(in), out, out_size);
+	if (err) {
+		kfree(out);
+		return err;
+	}
+
+	*gvmi = MLX5_GET(query_hca_cap_out, out, capability.cmd_hca_cap.vhca_id);
+
+	kfree(out);
+	return 0;
+}
+
+int mlx5dr_cmd_query_esw_caps(struct mlx5_core_dev *mdev,
+			      struct mlx5dr_esw_caps *caps)
+{
+	caps->drop_icm_address_rx =
+		MLX5_CAP64_ESW_FLOWTABLE(mdev,
+					 sw_steering_fdb_action_drop_icm_address_rx);
+	caps->drop_icm_address_tx =
+		MLX5_CAP64_ESW_FLOWTABLE(mdev,
+					 sw_steering_fdb_action_drop_icm_address_tx);
+	caps->uplink_icm_address_rx =
+		MLX5_CAP64_ESW_FLOWTABLE(mdev,
+					 sw_steering_uplink_icm_address_rx);
+	caps->uplink_icm_address_tx =
+		MLX5_CAP64_ESW_FLOWTABLE(mdev,
+					 sw_steering_uplink_icm_address_tx);
+	caps->sw_owner =
+		MLX5_CAP_ESW_FLOWTABLE_FDB(mdev,
+					   sw_owner);
+
+	return 0;
+}
+
+int mlx5dr_cmd_query_device(struct mlx5_core_dev *mdev,
+			    struct mlx5dr_cmd_caps *caps)
+{
+	caps->prio_tag_required	= MLX5_CAP_GEN(mdev, prio_tag_required);
+	caps->eswitch_manager	= MLX5_CAP_GEN(mdev, eswitch_manager);
+	caps->gvmi		= MLX5_CAP_GEN(mdev, vhca_id);
+	caps->flex_protocols	= MLX5_CAP_GEN(mdev, flex_parser_protocols);
+
+	if (mlx5dr_matcher_supp_flex_parser_icmp_v4(caps)) {
+		caps->flex_parser_id_icmp_dw0 = MLX5_CAP_GEN(mdev, flex_parser_id_icmp_dw0);
+		caps->flex_parser_id_icmp_dw1 = MLX5_CAP_GEN(mdev, flex_parser_id_icmp_dw1);
+	}
+
+	if (mlx5dr_matcher_supp_flex_parser_icmp_v6(caps)) {
+		caps->flex_parser_id_icmpv6_dw0 =
+			MLX5_CAP_GEN(mdev, flex_parser_id_icmpv6_dw0);
+		caps->flex_parser_id_icmpv6_dw1 =
+			MLX5_CAP_GEN(mdev, flex_parser_id_icmpv6_dw1);
+	}
+
+	caps->nic_rx_drop_address =
+		MLX5_CAP64_FLOWTABLE(mdev, sw_steering_nic_rx_action_drop_icm_address);
+	caps->nic_tx_drop_address =
+		MLX5_CAP64_FLOWTABLE(mdev, sw_steering_nic_tx_action_drop_icm_address);
+	caps->nic_tx_allow_address =
+		MLX5_CAP64_FLOWTABLE(mdev, sw_steering_nic_tx_action_allow_icm_address);
+
+	caps->rx_sw_owner = MLX5_CAP_FLOWTABLE_NIC_RX(mdev, sw_owner);
+	caps->max_ft_level = MLX5_CAP_FLOWTABLE_NIC_RX(mdev, max_ft_level);
+
+	caps->tx_sw_owner = MLX5_CAP_FLOWTABLE_NIC_TX(mdev, sw_owner);
+
+	caps->log_icm_size = MLX5_CAP_DEV_MEM(mdev, log_steering_sw_icm_size);
+	caps->hdr_modify_icm_addr =
+		MLX5_CAP64_DEV_MEM(mdev, header_modify_sw_icm_start_address);
+
+	caps->roce_min_src_udp = MLX5_CAP_ROCE(mdev, r_roce_min_src_udp_port);
+
+	return 0;
+}
+
+int mlx5dr_cmd_query_flow_table(struct mlx5_core_dev *dev,
+				enum fs_flow_table_type type,
+				u32 table_id,
+				struct mlx5dr_cmd_query_flow_table_details *output)
+{
+	u32 out[MLX5_ST_SZ_DW(query_flow_table_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(query_flow_table_in)] = {};
+	int err;
+
+	MLX5_SET(query_flow_table_in, in, opcode,
+		 MLX5_CMD_OP_QUERY_FLOW_TABLE);
+
+	MLX5_SET(query_flow_table_in, in, table_type, type);
+	MLX5_SET(query_flow_table_in, in, table_id, table_id);
+
+	err = mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+	if (err)
+		return err;
+
+	output->status = MLX5_GET(query_flow_table_out, out, status);
+	output->level = MLX5_GET(query_flow_table_out, out, flow_table_context.level);
+
+	output->sw_owner_icm_root_1 = MLX5_GET64(query_flow_table_out, out,
+						 flow_table_context.sw_owner_icm_root_1);
+	output->sw_owner_icm_root_0 = MLX5_GET64(query_flow_table_out, out,
+						 flow_table_context.sw_owner_icm_root_0);
+
+	return 0;
+}
+
+int mlx5dr_cmd_sync_steering(struct mlx5_core_dev *mdev)
+{
+	u32 out[MLX5_ST_SZ_DW(sync_steering_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(sync_steering_in)] = {};
+
+	MLX5_SET(sync_steering_in, in, opcode, MLX5_CMD_OP_SYNC_STEERING);
+
+	return mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+}
+
+int mlx5dr_cmd_set_fte_modify_and_vport(struct mlx5_core_dev *mdev,
+					u32 table_type,
+					u32 table_id,
+					u32 group_id,
+					u32 modify_header_id,
+					u32 vport_id)
+{
+	u32 out[MLX5_ST_SZ_DW(set_fte_out)] = {};
+	void *in_flow_context;
+	unsigned int inlen;
+	void *in_dests;
+	u32 *in;
+	int err;
+
+	inlen = MLX5_ST_SZ_BYTES(set_fte_in) +
+		1 * MLX5_ST_SZ_BYTES(dest_format_struct); /* One destination only */
+
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		return -ENOMEM;
+
+	MLX5_SET(set_fte_in, in, opcode, MLX5_CMD_OP_SET_FLOW_TABLE_ENTRY);
+	MLX5_SET(set_fte_in, in, table_type, table_type);
+	MLX5_SET(set_fte_in, in, table_id, table_id);
+
+	in_flow_context = MLX5_ADDR_OF(set_fte_in, in, flow_context);
+	MLX5_SET(flow_context, in_flow_context, group_id, group_id);
+	MLX5_SET(flow_context, in_flow_context, modify_header_id, modify_header_id);
+	MLX5_SET(flow_context, in_flow_context, destination_list_size, 1);
+	MLX5_SET(flow_context, in_flow_context, action,
+		 MLX5_FLOW_CONTEXT_ACTION_FWD_DEST |
+		 MLX5_FLOW_CONTEXT_ACTION_MOD_HDR);
+
+	in_dests = MLX5_ADDR_OF(flow_context, in_flow_context, destination);
+	MLX5_SET(dest_format_struct, in_dests, destination_type,
+		 MLX5_FLOW_DESTINATION_TYPE_VPORT);
+	MLX5_SET(dest_format_struct, in_dests, destination_id, vport_id);
+
+	err = mlx5_cmd_exec(mdev, in, inlen, out, sizeof(out));
+	kvfree(in);
+
+	return err;
+}
+
+int mlx5dr_cmd_del_flow_table_entry(struct mlx5_core_dev *mdev,
+				    u32 table_type,
+				    u32 table_id)
+{
+	u32 out[MLX5_ST_SZ_DW(delete_fte_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(delete_fte_in)] = {};
+
+	MLX5_SET(delete_fte_in, in, opcode, MLX5_CMD_OP_DELETE_FLOW_TABLE_ENTRY);
+	MLX5_SET(delete_fte_in, in, table_type, table_type);
+	MLX5_SET(delete_fte_in, in, table_id, table_id);
+
+	return mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+}
+
+int mlx5dr_cmd_alloc_modify_header(struct mlx5_core_dev *mdev,
+				   u32 table_type,
+				   u8 num_of_actions,
+				   u64 *actions,
+				   u32 *modify_header_id)
+{
+	u32 out[MLX5_ST_SZ_DW(alloc_modify_header_context_out)] = {};
+	void *p_actions;
+	u32 inlen;
+	u32 *in;
+	int err;
+
+	inlen = MLX5_ST_SZ_BYTES(alloc_modify_header_context_in) +
+		 num_of_actions * sizeof(u64);
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		return -ENOMEM;
+
+	MLX5_SET(alloc_modify_header_context_in, in, opcode,
+		 MLX5_CMD_OP_ALLOC_MODIFY_HEADER_CONTEXT);
+	MLX5_SET(alloc_modify_header_context_in, in, table_type, table_type);
+	MLX5_SET(alloc_modify_header_context_in, in, num_of_actions, num_of_actions);
+	p_actions = MLX5_ADDR_OF(alloc_modify_header_context_in, in, actions);
+	memcpy(p_actions, actions, num_of_actions * sizeof(u64));
+
+	err = mlx5_cmd_exec(mdev, in, inlen, out, sizeof(out));
+	if (err)
+		goto out;
+
+	*modify_header_id = MLX5_GET(alloc_modify_header_context_out, out,
+				     modify_header_id);
+out:
+	kvfree(in);
+	return err;
+}
+
+int mlx5dr_cmd_dealloc_modify_header(struct mlx5_core_dev *mdev,
+				     u32 modify_header_id)
+{
+	u32 out[MLX5_ST_SZ_DW(dealloc_modify_header_context_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(dealloc_modify_header_context_in)] = {};
+
+	MLX5_SET(dealloc_modify_header_context_in, in, opcode,
+		 MLX5_CMD_OP_DEALLOC_MODIFY_HEADER_CONTEXT);
+	MLX5_SET(dealloc_modify_header_context_in, in, modify_header_id,
+		 modify_header_id);
+
+	return mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+}
+
+int mlx5dr_cmd_create_empty_flow_group(struct mlx5_core_dev *mdev,
+				       u32 table_type,
+				       u32 table_id,
+				       u32 *group_id)
+{
+	u32 out[MLX5_ST_SZ_DW(create_flow_group_out)] = {};
+	int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
+	u32 *in;
+	int err;
+
+	in = kzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		return -ENOMEM;
+
+	MLX5_SET(create_flow_group_in, in, opcode, MLX5_CMD_OP_CREATE_FLOW_GROUP);
+	MLX5_SET(create_flow_group_in, in, table_type, table_type);
+	MLX5_SET(create_flow_group_in, in, table_id, table_id);
+
+	err = mlx5_cmd_exec(mdev, in, inlen, out, sizeof(out));
+	if (err)
+		goto out;
+
+	*group_id = MLX5_GET(create_flow_group_out, out, group_id);
+
+out:
+	kfree(in);
+	return err;
+}
+
+int mlx5dr_cmd_destroy_flow_group(struct mlx5_core_dev *mdev,
+				  u32 table_type,
+				  u32 table_id,
+				  u32 group_id)
+{
+	u32 in[MLX5_ST_SZ_DW(destroy_flow_group_in)] = {};
+	u32 out[MLX5_ST_SZ_DW(destroy_flow_group_out)] = {};
+
+	MLX5_SET(create_flow_group_in, in, opcode, MLX5_CMD_OP_DESTROY_FLOW_GROUP);
+	MLX5_SET(destroy_flow_group_in, in, table_type, table_type);
+	MLX5_SET(destroy_flow_group_in, in, table_id, table_id);
+	MLX5_SET(destroy_flow_group_in, in, group_id, group_id);
+
+	return mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+}
+
+int mlx5dr_cmd_create_flow_table(struct mlx5_core_dev *mdev,
+				 u32 table_type,
+				 u64 icm_addr_rx,
+				 u64 icm_addr_tx,
+				 u8 level,
+				 bool sw_owner,
+				 bool term_tbl,
+				 u64 *fdb_rx_icm_addr,
+				 u32 *table_id)
+{
+	u32 out[MLX5_ST_SZ_DW(create_flow_table_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(create_flow_table_in)] = {};
+	void *ft_mdev;
+	int err;
+
+	MLX5_SET(create_flow_table_in, in, opcode, MLX5_CMD_OP_CREATE_FLOW_TABLE);
+	MLX5_SET(create_flow_table_in, in, table_type, table_type);
+
+	ft_mdev = MLX5_ADDR_OF(create_flow_table_in, in, flow_table_context);
+	MLX5_SET(flow_table_context, ft_mdev, termination_table, term_tbl);
+	MLX5_SET(flow_table_context, ft_mdev, sw_owner, sw_owner);
+	MLX5_SET(flow_table_context, ft_mdev, level, level);
+
+	if (sw_owner) {
+		/* icm_addr_0 used for FDB RX / NIC TX / NIC_RX
+		 * icm_addr_1 used for FDB TX
+		 */
+		if (table_type == MLX5_FLOW_TABLE_TYPE_NIC_RX) {
+			MLX5_SET64(flow_table_context, ft_mdev,
+				   sw_owner_icm_root_0, icm_addr_rx);
+		} else if (table_type == MLX5_FLOW_TABLE_TYPE_NIC_TX) {
+			MLX5_SET64(flow_table_context, ft_mdev,
+				   sw_owner_icm_root_0, icm_addr_tx);
+		} else if (table_type == MLX5_FLOW_TABLE_TYPE_FDB) {
+			MLX5_SET64(flow_table_context, ft_mdev,
+				   sw_owner_icm_root_0, icm_addr_rx);
+			MLX5_SET64(flow_table_context, ft_mdev,
+				   sw_owner_icm_root_1, icm_addr_tx);
+		}
+	}
+
+	err = mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+	if (err)
+		return err;
+
+	*table_id = MLX5_GET(create_flow_table_out, out, table_id);
+	if (!sw_owner && table_type == MLX5_FLOW_TABLE_TYPE_FDB)
+		*fdb_rx_icm_addr =
+		(u64)MLX5_GET(create_flow_table_out, out, icm_address_31_0) |
+		(u64)MLX5_GET(create_flow_table_out, out, icm_address_39_32) << 32 |
+		(u64)MLX5_GET(create_flow_table_out, out, icm_address_63_40) << 40;
+
+	return 0;
+}
+
+int mlx5dr_cmd_destroy_flow_table(struct mlx5_core_dev *mdev,
+				  u32 table_id,
+				  u32 table_type)
+{
+	u32 out[MLX5_ST_SZ_DW(destroy_flow_table_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(destroy_flow_table_in)] = {};
+
+	MLX5_SET(destroy_flow_table_in, in, opcode,
+		 MLX5_CMD_OP_DESTROY_FLOW_TABLE);
+	MLX5_SET(destroy_flow_table_in, in, table_type, table_type);
+	MLX5_SET(destroy_flow_table_in, in, table_id, table_id);
+
+	return mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+}
+
+int mlx5dr_cmd_create_reformat_ctx(struct mlx5_core_dev *mdev,
+				   enum mlx5_reformat_ctx_type rt,
+				   size_t reformat_size,
+				   void *reformat_data,
+				   u32 *reformat_id)
+{
+	u32 out[MLX5_ST_SZ_DW(alloc_packet_reformat_context_out)] = {};
+	size_t inlen, cmd_data_sz, cmd_total_sz;
+	void *prctx;
+	void *pdata;
+	void *in;
+	int err;
+
+	cmd_total_sz = MLX5_ST_SZ_BYTES(alloc_packet_reformat_context_in);
+	cmd_data_sz = MLX5_FLD_SZ_BYTES(alloc_packet_reformat_context_in,
+					packet_reformat_context.reformat_data);
+	inlen = ALIGN(cmd_total_sz + reformat_size - cmd_data_sz, 4);
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		return -ENOMEM;
+
+	MLX5_SET(alloc_packet_reformat_context_in, in, opcode,
+		 MLX5_CMD_OP_ALLOC_PACKET_REFORMAT_CONTEXT);
+
+	prctx = MLX5_ADDR_OF(alloc_packet_reformat_context_in, in, packet_reformat_context);
+	pdata = MLX5_ADDR_OF(packet_reformat_context_in, prctx, reformat_data);
+
+	MLX5_SET(packet_reformat_context_in, prctx, reformat_type, rt);
+	MLX5_SET(packet_reformat_context_in, prctx, reformat_data_size, reformat_size);
+	memcpy(pdata, reformat_data, reformat_size);
+
+	err = mlx5_cmd_exec(mdev, in, inlen, out, sizeof(out));
+	if (err)
+		return err;
+
+	*reformat_id = MLX5_GET(alloc_packet_reformat_context_out, out, packet_reformat_id);
+	kvfree(in);
+
+	return err;
+}
+
+void mlx5dr_cmd_destroy_reformat_ctx(struct mlx5_core_dev *mdev,
+				     u32 reformat_id)
+{
+	u32 out[MLX5_ST_SZ_DW(dealloc_packet_reformat_context_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(dealloc_packet_reformat_context_in)] = {};
+
+	MLX5_SET(dealloc_packet_reformat_context_in, in, opcode,
+		 MLX5_CMD_OP_DEALLOC_PACKET_REFORMAT_CONTEXT);
+	MLX5_SET(dealloc_packet_reformat_context_in, in, packet_reformat_id,
+		 reformat_id);
+
+	mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+}
+
+int mlx5dr_cmd_query_gid(struct mlx5_core_dev *mdev, u8 vhca_port_num,
+			 u16 index, struct mlx5dr_cmd_gid_attr *attr)
+{
+	u32 out[MLX5_ST_SZ_DW(query_roce_address_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(query_roce_address_in)] = {};
+	int err;
+
+	MLX5_SET(query_roce_address_in, in, opcode,
+		 MLX5_CMD_OP_QUERY_ROCE_ADDRESS);
+
+	MLX5_SET(query_roce_address_in, in, roce_address_index, index);
+	MLX5_SET(query_roce_address_in, in, vhca_port_num, vhca_port_num);
+
+	err = mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+	if (err)
+		return err;
+
+	memcpy(&attr->gid,
+	       MLX5_ADDR_OF(query_roce_address_out,
+			    out, roce_address.source_l3_address),
+	       sizeof(attr->gid));
+	memcpy(attr->mac,
+	       MLX5_ADDR_OF(query_roce_address_out, out,
+			    roce_address.source_mac_47_32),
+	       sizeof(attr->mac));
+
+	if (MLX5_GET(query_roce_address_out, out,
+		     roce_address.roce_version) == MLX5_ROCE_VERSION_2)
+		attr->roce_ver = MLX5_ROCE_VERSION_2;
+	else
+		attr->roce_ver = MLX5_ROCE_VERSION_1;
+
+	return 0;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/steering/mlx5_ifc_dr.h b/drivers/net/ethernet/mellanox/mlx5/core/steering/mlx5_ifc_dr.h
new file mode 100644
index 000000000000..596c927220d9
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/mlx5_ifc_dr.h
@@ -0,0 +1,604 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2019, Mellanox Technologies */
+
+#ifndef MLX5_IFC_DR_H
+#define MLX5_IFC_DR_H
+
+enum {
+	MLX5DR_ACTION_MDFY_HW_FLD_L2_0		= 0,
+	MLX5DR_ACTION_MDFY_HW_FLD_L2_1		= 1,
+	MLX5DR_ACTION_MDFY_HW_FLD_L2_2		= 2,
+	MLX5DR_ACTION_MDFY_HW_FLD_L3_0		= 3,
+	MLX5DR_ACTION_MDFY_HW_FLD_L3_1		= 4,
+	MLX5DR_ACTION_MDFY_HW_FLD_L3_2		= 5,
+	MLX5DR_ACTION_MDFY_HW_FLD_L3_3		= 6,
+	MLX5DR_ACTION_MDFY_HW_FLD_L3_4		= 7,
+	MLX5DR_ACTION_MDFY_HW_FLD_L4_0		= 8,
+	MLX5DR_ACTION_MDFY_HW_FLD_L4_1		= 9,
+	MLX5DR_ACTION_MDFY_HW_FLD_MPLS		= 10,
+	MLX5DR_ACTION_MDFY_HW_FLD_L2_TNL_0	= 11,
+	MLX5DR_ACTION_MDFY_HW_FLD_REG_0		= 12,
+	MLX5DR_ACTION_MDFY_HW_FLD_REG_1		= 13,
+	MLX5DR_ACTION_MDFY_HW_FLD_REG_2		= 14,
+	MLX5DR_ACTION_MDFY_HW_FLD_REG_3		= 15,
+	MLX5DR_ACTION_MDFY_HW_FLD_L4_2		= 16,
+	MLX5DR_ACTION_MDFY_HW_FLD_FLEX_0	= 17,
+	MLX5DR_ACTION_MDFY_HW_FLD_FLEX_1	= 18,
+	MLX5DR_ACTION_MDFY_HW_FLD_FLEX_2	= 19,
+	MLX5DR_ACTION_MDFY_HW_FLD_FLEX_3	= 20,
+	MLX5DR_ACTION_MDFY_HW_FLD_L2_TNL_1	= 21,
+	MLX5DR_ACTION_MDFY_HW_FLD_METADATA	= 22,
+	MLX5DR_ACTION_MDFY_HW_FLD_RESERVED	= 23,
+};
+
+enum {
+	MLX5DR_ACTION_MDFY_HW_OP_SET		= 0x2,
+	MLX5DR_ACTION_MDFY_HW_OP_ADD		= 0x3,
+};
+
+enum {
+	MLX5DR_ACTION_MDFY_HW_HDR_L3_NONE	= 0x0,
+	MLX5DR_ACTION_MDFY_HW_HDR_L3_IPV4	= 0x1,
+	MLX5DR_ACTION_MDFY_HW_HDR_L3_IPV6	= 0x2,
+};
+
+enum {
+	MLX5DR_ACTION_MDFY_HW_HDR_L4_NONE	= 0x0,
+	MLX5DR_ACTION_MDFY_HW_HDR_L4_TCP	= 0x1,
+	MLX5DR_ACTION_MDFY_HW_HDR_L4_UDP	= 0x2,
+};
+
+enum {
+	MLX5DR_STE_LU_TYPE_NOP				= 0x00,
+	MLX5DR_STE_LU_TYPE_SRC_GVMI_AND_QP		= 0x05,
+	MLX5DR_STE_LU_TYPE_ETHL2_TUNNELING_I		= 0x0a,
+	MLX5DR_STE_LU_TYPE_ETHL2_DST_O			= 0x06,
+	MLX5DR_STE_LU_TYPE_ETHL2_DST_I			= 0x07,
+	MLX5DR_STE_LU_TYPE_ETHL2_DST_D			= 0x1b,
+	MLX5DR_STE_LU_TYPE_ETHL2_SRC_O			= 0x08,
+	MLX5DR_STE_LU_TYPE_ETHL2_SRC_I			= 0x09,
+	MLX5DR_STE_LU_TYPE_ETHL2_SRC_D			= 0x1c,
+	MLX5DR_STE_LU_TYPE_ETHL2_SRC_DST_O		= 0x36,
+	MLX5DR_STE_LU_TYPE_ETHL2_SRC_DST_I		= 0x37,
+	MLX5DR_STE_LU_TYPE_ETHL2_SRC_DST_D		= 0x38,
+	MLX5DR_STE_LU_TYPE_ETHL3_IPV6_DST_O		= 0x0d,
+	MLX5DR_STE_LU_TYPE_ETHL3_IPV6_DST_I		= 0x0e,
+	MLX5DR_STE_LU_TYPE_ETHL3_IPV6_DST_D		= 0x1e,
+	MLX5DR_STE_LU_TYPE_ETHL3_IPV6_SRC_O		= 0x0f,
+	MLX5DR_STE_LU_TYPE_ETHL3_IPV6_SRC_I		= 0x10,
+	MLX5DR_STE_LU_TYPE_ETHL3_IPV6_SRC_D		= 0x1f,
+	MLX5DR_STE_LU_TYPE_ETHL3_IPV4_5_TUPLE_O		= 0x11,
+	MLX5DR_STE_LU_TYPE_ETHL3_IPV4_5_TUPLE_I		= 0x12,
+	MLX5DR_STE_LU_TYPE_ETHL3_IPV4_5_TUPLE_D		= 0x20,
+	MLX5DR_STE_LU_TYPE_ETHL3_IPV4_MISC_O		= 0x29,
+	MLX5DR_STE_LU_TYPE_ETHL3_IPV4_MISC_I		= 0x2a,
+	MLX5DR_STE_LU_TYPE_ETHL3_IPV4_MISC_D		= 0x2b,
+	MLX5DR_STE_LU_TYPE_ETHL4_O			= 0x13,
+	MLX5DR_STE_LU_TYPE_ETHL4_I			= 0x14,
+	MLX5DR_STE_LU_TYPE_ETHL4_D			= 0x21,
+	MLX5DR_STE_LU_TYPE_ETHL4_MISC_O			= 0x2c,
+	MLX5DR_STE_LU_TYPE_ETHL4_MISC_I			= 0x2d,
+	MLX5DR_STE_LU_TYPE_ETHL4_MISC_D			= 0x2e,
+	MLX5DR_STE_LU_TYPE_MPLS_FIRST_O			= 0x15,
+	MLX5DR_STE_LU_TYPE_MPLS_FIRST_I			= 0x24,
+	MLX5DR_STE_LU_TYPE_MPLS_FIRST_D			= 0x25,
+	MLX5DR_STE_LU_TYPE_GRE				= 0x16,
+	MLX5DR_STE_LU_TYPE_FLEX_PARSER_0		= 0x22,
+	MLX5DR_STE_LU_TYPE_FLEX_PARSER_1		= 0x23,
+	MLX5DR_STE_LU_TYPE_FLEX_PARSER_TNL_HEADER	= 0x19,
+	MLX5DR_STE_LU_TYPE_GENERAL_PURPOSE		= 0x18,
+	MLX5DR_STE_LU_TYPE_STEERING_REGISTERS_0		= 0x2f,
+	MLX5DR_STE_LU_TYPE_STEERING_REGISTERS_1		= 0x30,
+	MLX5DR_STE_LU_TYPE_DONT_CARE			= 0x0f,
+};
+
+enum mlx5dr_ste_entry_type {
+	MLX5DR_STE_TYPE_TX		= 1,
+	MLX5DR_STE_TYPE_RX		= 2,
+	MLX5DR_STE_TYPE_MODIFY_PKT	= 6,
+};
+
+struct mlx5_ifc_ste_general_bits {
+	u8         entry_type[0x4];
+	u8         reserved_at_4[0x4];
+	u8         entry_sub_type[0x8];
+	u8         byte_mask[0x10];
+
+	u8         next_table_base_63_48[0x10];
+	u8         next_lu_type[0x8];
+	u8         next_table_base_39_32_size[0x8];
+
+	u8         next_table_base_31_5_size[0x1b];
+	u8         linear_hash_enable[0x1];
+	u8         reserved_at_5c[0x2];
+	u8         next_table_rank[0x2];
+
+	u8         reserved_at_60[0xa0];
+	u8         tag_value[0x60];
+	u8         bit_mask[0x60];
+};
+
+struct mlx5_ifc_ste_sx_transmit_bits {
+	u8         entry_type[0x4];
+	u8         reserved_at_4[0x4];
+	u8         entry_sub_type[0x8];
+	u8         byte_mask[0x10];
+
+	u8         next_table_base_63_48[0x10];
+	u8         next_lu_type[0x8];
+	u8         next_table_base_39_32_size[0x8];
+
+	u8         next_table_base_31_5_size[0x1b];
+	u8         linear_hash_enable[0x1];
+	u8         reserved_at_5c[0x2];
+	u8         next_table_rank[0x2];
+
+	u8         sx_wire[0x1];
+	u8         sx_func_lb[0x1];
+	u8         sx_sniffer[0x1];
+	u8         sx_wire_enable[0x1];
+	u8         sx_func_lb_enable[0x1];
+	u8         sx_sniffer_enable[0x1];
+	u8         action_type[0x3];
+	u8         reserved_at_69[0x1];
+	u8         action_description[0x6];
+	u8         gvmi[0x10];
+
+	u8         encap_pointer_vlan_data[0x20];
+
+	u8         loopback_syndome_en[0x8];
+	u8         loopback_syndome[0x8];
+	u8         counter_trigger[0x10];
+
+	u8         miss_address_63_48[0x10];
+	u8         counter_trigger_23_16[0x8];
+	u8         miss_address_39_32[0x8];
+
+	u8         miss_address_31_6[0x1a];
+	u8         learning_point[0x1];
+	u8         go_back[0x1];
+	u8         match_polarity[0x1];
+	u8         mask_mode[0x1];
+	u8         miss_rank[0x2];
+};
+
+struct mlx5_ifc_ste_rx_steering_mult_bits {
+	u8         entry_type[0x4];
+	u8         reserved_at_4[0x4];
+	u8         entry_sub_type[0x8];
+	u8         byte_mask[0x10];
+
+	u8         next_table_base_63_48[0x10];
+	u8         next_lu_type[0x8];
+	u8         next_table_base_39_32_size[0x8];
+
+	u8         next_table_base_31_5_size[0x1b];
+	u8         linear_hash_enable[0x1];
+	u8         reserved_at_[0x2];
+	u8         next_table_rank[0x2];
+
+	u8         member_count[0x10];
+	u8         gvmi[0x10];
+
+	u8         qp_list_pointer[0x20];
+
+	u8         reserved_at_a0[0x1];
+	u8         tunneling_action[0x3];
+	u8         action_description[0x4];
+	u8         reserved_at_a8[0x8];
+	u8         counter_trigger_15_0[0x10];
+
+	u8         miss_address_63_48[0x10];
+	u8         counter_trigger_23_16[0x08];
+	u8         miss_address_39_32[0x8];
+
+	u8         miss_address_31_6[0x1a];
+	u8         learning_point[0x1];
+	u8         fail_on_error[0x1];
+	u8         match_polarity[0x1];
+	u8         mask_mode[0x1];
+	u8         miss_rank[0x2];
+};
+
+struct mlx5_ifc_ste_modify_packet_bits {
+	u8         entry_type[0x4];
+	u8         reserved_at_4[0x4];
+	u8         entry_sub_type[0x8];
+	u8         byte_mask[0x10];
+
+	u8         next_table_base_63_48[0x10];
+	u8         next_lu_type[0x8];
+	u8         next_table_base_39_32_size[0x8];
+
+	u8         next_table_base_31_5_size[0x1b];
+	u8         linear_hash_enable[0x1];
+	u8         reserved_at_[0x2];
+	u8         next_table_rank[0x2];
+
+	u8         number_of_re_write_actions[0x10];
+	u8         gvmi[0x10];
+
+	u8         header_re_write_actions_pointer[0x20];
+
+	u8         reserved_at_a0[0x1];
+	u8         tunneling_action[0x3];
+	u8         action_description[0x4];
+	u8         reserved_at_a8[0x8];
+	u8         counter_trigger_15_0[0x10];
+
+	u8         miss_address_63_48[0x10];
+	u8         counter_trigger_23_16[0x08];
+	u8         miss_address_39_32[0x8];
+
+	u8         miss_address_31_6[0x1a];
+	u8         learning_point[0x1];
+	u8         fail_on_error[0x1];
+	u8         match_polarity[0x1];
+	u8         mask_mode[0x1];
+	u8         miss_rank[0x2];
+};
+
+struct mlx5_ifc_ste_eth_l2_src_bits {
+	u8         smac_47_16[0x20];
+
+	u8         smac_15_0[0x10];
+	u8         l3_ethertype[0x10];
+
+	u8         qp_type[0x2];
+	u8         ethertype_filter[0x1];
+	u8         reserved_at_43[0x1];
+	u8         sx_sniffer[0x1];
+	u8         force_lb[0x1];
+	u8         functional_lb[0x1];
+	u8         port[0x1];
+	u8         reserved_at_48[0x4];
+	u8         first_priority[0x3];
+	u8         first_cfi[0x1];
+	u8         first_vlan_qualifier[0x2];
+	u8         reserved_at_52[0x2];
+	u8         first_vlan_id[0xc];
+
+	u8         ip_fragmented[0x1];
+	u8         tcp_syn[0x1];
+	u8         encp_type[0x2];
+	u8         l3_type[0x2];
+	u8         l4_type[0x2];
+	u8         reserved_at_68[0x4];
+	u8         second_priority[0x3];
+	u8         second_cfi[0x1];
+	u8         second_vlan_qualifier[0x2];
+	u8         reserved_at_72[0x2];
+	u8         second_vlan_id[0xc];
+};
+
+struct mlx5_ifc_ste_eth_l2_dst_bits {
+	u8         dmac_47_16[0x20];
+
+	u8         dmac_15_0[0x10];
+	u8         l3_ethertype[0x10];
+
+	u8         qp_type[0x2];
+	u8         ethertype_filter[0x1];
+	u8         reserved_at_43[0x1];
+	u8         sx_sniffer[0x1];
+	u8         force_lb[0x1];
+	u8         functional_lb[0x1];
+	u8         port[0x1];
+	u8         reserved_at_48[0x4];
+	u8         first_priority[0x3];
+	u8         first_cfi[0x1];
+	u8         first_vlan_qualifier[0x2];
+	u8         reserved_at_52[0x2];
+	u8         first_vlan_id[0xc];
+
+	u8         ip_fragmented[0x1];
+	u8         tcp_syn[0x1];
+	u8         encp_type[0x2];
+	u8         l3_type[0x2];
+	u8         l4_type[0x2];
+	u8         reserved_at_68[0x4];
+	u8         second_priority[0x3];
+	u8         second_cfi[0x1];
+	u8         second_vlan_qualifier[0x2];
+	u8         reserved_at_72[0x2];
+	u8         second_vlan_id[0xc];
+};
+
+struct mlx5_ifc_ste_eth_l2_src_dst_bits {
+	u8         dmac_47_16[0x20];
+
+	u8         dmac_15_0[0x10];
+	u8         smac_47_32[0x10];
+
+	u8         smac_31_0[0x20];
+
+	u8         sx_sniffer[0x1];
+	u8         force_lb[0x1];
+	u8         functional_lb[0x1];
+	u8         port[0x1];
+	u8         l3_type[0x2];
+	u8         reserved_at_66[0x6];
+	u8         first_priority[0x3];
+	u8         first_cfi[0x1];
+	u8         first_vlan_qualifier[0x2];
+	u8         reserved_at_72[0x2];
+	u8         first_vlan_id[0xc];
+};
+
+struct mlx5_ifc_ste_eth_l3_ipv4_5_tuple_bits {
+	u8         destination_address[0x20];
+
+	u8         source_address[0x20];
+
+	u8         source_port[0x10];
+	u8         destination_port[0x10];
+
+	u8         fragmented[0x1];
+	u8         first_fragment[0x1];
+	u8         reserved_at_62[0x2];
+	u8         reserved_at_64[0x1];
+	u8         ecn[0x2];
+	u8         tcp_ns[0x1];
+	u8         tcp_cwr[0x1];
+	u8         tcp_ece[0x1];
+	u8         tcp_urg[0x1];
+	u8         tcp_ack[0x1];
+	u8         tcp_psh[0x1];
+	u8         tcp_rst[0x1];
+	u8         tcp_syn[0x1];
+	u8         tcp_fin[0x1];
+	u8         dscp[0x6];
+	u8         reserved_at_76[0x2];
+	u8         protocol[0x8];
+};
+
+struct mlx5_ifc_ste_eth_l3_ipv6_dst_bits {
+	u8         dst_ip_127_96[0x20];
+
+	u8         dst_ip_95_64[0x20];
+
+	u8         dst_ip_63_32[0x20];
+
+	u8         dst_ip_31_0[0x20];
+};
+
+struct mlx5_ifc_ste_eth_l2_tnl_bits {
+	u8         dmac_47_16[0x20];
+
+	u8         dmac_15_0[0x10];
+	u8         l3_ethertype[0x10];
+
+	u8         l2_tunneling_network_id[0x20];
+
+	u8         ip_fragmented[0x1];
+	u8         tcp_syn[0x1];
+	u8         encp_type[0x2];
+	u8         l3_type[0x2];
+	u8         l4_type[0x2];
+	u8         first_priority[0x3];
+	u8         first_cfi[0x1];
+	u8         reserved_at_6c[0x3];
+	u8         gre_key_flag[0x1];
+	u8         first_vlan_qualifier[0x2];
+	u8         reserved_at_72[0x2];
+	u8         first_vlan_id[0xc];
+};
+
+struct mlx5_ifc_ste_eth_l3_ipv6_src_bits {
+	u8         src_ip_127_96[0x20];
+
+	u8         src_ip_95_64[0x20];
+
+	u8         src_ip_63_32[0x20];
+
+	u8         src_ip_31_0[0x20];
+};
+
+struct mlx5_ifc_ste_eth_l3_ipv4_misc_bits {
+	u8         version[0x4];
+	u8         ihl[0x4];
+	u8         reserved_at_8[0x8];
+	u8         total_length[0x10];
+
+	u8         identification[0x10];
+	u8         flags[0x3];
+	u8         fragment_offset[0xd];
+
+	u8         time_to_live[0x8];
+	u8         reserved_at_48[0x8];
+	u8         checksum[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_ste_eth_l4_bits {
+	u8         fragmented[0x1];
+	u8         first_fragment[0x1];
+	u8         reserved_at_2[0x6];
+	u8         protocol[0x8];
+	u8         dst_port[0x10];
+
+	u8         ipv6_version[0x4];
+	u8         reserved_at_24[0x1];
+	u8         ecn[0x2];
+	u8         tcp_ns[0x1];
+	u8         tcp_cwr[0x1];
+	u8         tcp_ece[0x1];
+	u8         tcp_urg[0x1];
+	u8         tcp_ack[0x1];
+	u8         tcp_psh[0x1];
+	u8         tcp_rst[0x1];
+	u8         tcp_syn[0x1];
+	u8         tcp_fin[0x1];
+	u8         src_port[0x10];
+
+	u8         ipv6_payload_length[0x10];
+	u8         ipv6_hop_limit[0x8];
+	u8         dscp[0x6];
+	u8         reserved_at_5e[0x2];
+
+	u8         tcp_data_offset[0x4];
+	u8         reserved_at_64[0x8];
+	u8         flow_label[0x14];
+};
+
+struct mlx5_ifc_ste_eth_l4_misc_bits {
+	u8         checksum[0x10];
+	u8         length[0x10];
+
+	u8         seq_num[0x20];
+
+	u8         ack_num[0x20];
+
+	u8         urgent_pointer[0x10];
+	u8         window_size[0x10];
+};
+
+struct mlx5_ifc_ste_mpls_bits {
+	u8         mpls0_label[0x14];
+	u8         mpls0_exp[0x3];
+	u8         mpls0_s_bos[0x1];
+	u8         mpls0_ttl[0x8];
+
+	u8         mpls1_label[0x20];
+
+	u8         mpls2_label[0x20];
+
+	u8         reserved_at_60[0x16];
+	u8         mpls4_s_bit[0x1];
+	u8         mpls4_qualifier[0x1];
+	u8         mpls3_s_bit[0x1];
+	u8         mpls3_qualifier[0x1];
+	u8         mpls2_s_bit[0x1];
+	u8         mpls2_qualifier[0x1];
+	u8         mpls1_s_bit[0x1];
+	u8         mpls1_qualifier[0x1];
+	u8         mpls0_s_bit[0x1];
+	u8         mpls0_qualifier[0x1];
+};
+
+struct mlx5_ifc_ste_register_0_bits {
+	u8         register_0_h[0x20];
+
+	u8         register_0_l[0x20];
+
+	u8         register_1_h[0x20];
+
+	u8         register_1_l[0x20];
+};
+
+struct mlx5_ifc_ste_register_1_bits {
+	u8         register_2_h[0x20];
+
+	u8         register_2_l[0x20];
+
+	u8         register_3_h[0x20];
+
+	u8         register_3_l[0x20];
+};
+
+struct mlx5_ifc_ste_gre_bits {
+	u8         gre_c_present[0x1];
+	u8         reserved_at_30[0x1];
+	u8         gre_k_present[0x1];
+	u8         gre_s_present[0x1];
+	u8         strict_src_route[0x1];
+	u8         recur[0x3];
+	u8         flags[0x5];
+	u8         version[0x3];
+	u8         gre_protocol[0x10];
+
+	u8         checksum[0x10];
+	u8         offset[0x10];
+
+	u8         gre_key_h[0x18];
+	u8         gre_key_l[0x8];
+
+	u8         seq_num[0x20];
+};
+
+struct mlx5_ifc_ste_flex_parser_0_bits {
+	u8         parser_3_label[0x14];
+	u8         parser_3_exp[0x3];
+	u8         parser_3_s_bos[0x1];
+	u8         parser_3_ttl[0x8];
+
+	u8         flex_parser_2[0x20];
+
+	u8         flex_parser_1[0x20];
+
+	u8         flex_parser_0[0x20];
+};
+
+struct mlx5_ifc_ste_flex_parser_1_bits {
+	u8         flex_parser_7[0x20];
+
+	u8         flex_parser_6[0x20];
+
+	u8         flex_parser_5[0x20];
+
+	u8         flex_parser_4[0x20];
+};
+
+struct mlx5_ifc_ste_flex_parser_tnl_bits {
+	u8         flex_parser_tunneling_header_63_32[0x20];
+
+	u8         flex_parser_tunneling_header_31_0[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_ste_general_purpose_bits {
+	u8         general_purpose_lookup_field[0x20];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x20];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_ste_src_gvmi_qp_bits {
+	u8         loopback_syndrome[0x8];
+	u8         reserved_at_8[0x8];
+	u8         source_gvmi[0x10];
+
+	u8         reserved_at_20[0x5];
+	u8         force_lb[0x1];
+	u8         functional_lb[0x1];
+	u8         source_is_requestor[0x1];
+	u8         source_qp[0x18];
+
+	u8         reserved_at_40[0x20];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_l2_hdr_bits {
+	u8         dmac_47_16[0x20];
+
+	u8         dmac_15_0[0x10];
+	u8         smac_47_32[0x10];
+
+	u8         smac_31_0[0x20];
+
+	u8         ethertype[0x10];
+	u8         vlan_type[0x10];
+
+	u8         vlan[0x10];
+	u8         reserved_at_90[0x10];
+};
+
+/* Both HW set and HW add share the same HW format with different opcodes */
+struct mlx5_ifc_dr_action_hw_set_bits {
+	u8         opcode[0x8];
+	u8         destination_field_code[0x8];
+	u8         reserved_at_10[0x2];
+	u8         destination_left_shifter[0x6];
+	u8         reserved_at_18[0x3];
+	u8         destination_length[0x5];
+
+	u8         inline_data[0x20];
+};
+
+#endif /* MLX5_IFC_DR_H */
-- 
2.21.0


^ permalink raw reply related

* [net-next 04/18] net/mlx5: DR, ICM pool memory allocator
From: Saeed Mahameed @ 2019-09-02  7:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Alex Vesker, Erez Shitrit, Mark Bloch,
	Saeed Mahameed
In-Reply-To: <20190902072213.7683-1-saeedm@mellanox.com>

From: Alex Vesker <valex@mellanox.com>

ICM device memory is used for writing steering rules (STEs) to the NIC.
An ICM memory pool allocator was implemented to manage the required
memory. The pool consists of buckets, a bucket per chunk size.
Once a bucket is empty we will cut a row of memory from the latest
allocated MR, if the MR size is not sufficient we will allocate a new MR.
HW design requires that chunks memory address should be aligned to the
chunk size, this is the reason for managing the MR with row size that
insures memory alignment.
Current design is greedy in memory but provides quick allocation times
in steady state.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../mellanox/mlx5/core/steering/dr_icm_pool.c | 570 ++++++++++++++++++
 1 file changed, 570 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c
new file mode 100644
index 000000000000..e76f61e7555e
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c
@@ -0,0 +1,570 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2019 Mellanox Technologies. */
+
+#include "dr_types.h"
+
+#define DR_ICM_MODIFY_HDR_ALIGN_BASE 64
+#define DR_ICM_SYNC_THRESHOLD (64 * 1024 * 1024)
+
+struct mlx5dr_icm_pool;
+
+struct mlx5dr_icm_bucket {
+	struct mlx5dr_icm_pool *pool;
+
+	/* Chunks that aren't visible to HW not directly and not in cache */
+	struct list_head free_list;
+	unsigned int free_list_count;
+
+	/* Used chunks, HW may be accessing this memory */
+	struct list_head used_list;
+	unsigned int used_list_count;
+
+	/* HW may be accessing this memory but at some future,
+	 * undetermined time, it might cease to do so. Before deciding to call
+	 * sync_ste, this list is moved to sync_list
+	 */
+	struct list_head hot_list;
+	unsigned int hot_list_count;
+
+	/* Pending sync list, entries from the hot list are moved to this list.
+	 * sync_ste is executed and then sync_list is concatenated to the free list
+	 */
+	struct list_head sync_list;
+	unsigned int sync_list_count;
+
+	u32 total_chunks;
+	u32 num_of_entries;
+	u32 entry_size;
+	/* protect the ICM bucket */
+	struct mutex mutex;
+};
+
+struct mlx5dr_icm_pool {
+	struct mlx5dr_icm_bucket *buckets;
+	enum mlx5dr_icm_type icm_type;
+	enum mlx5dr_icm_chunk_size max_log_chunk_sz;
+	enum mlx5dr_icm_chunk_size num_of_buckets;
+	struct list_head icm_mr_list;
+	/* protect the ICM MR list */
+	struct mutex mr_mutex;
+	struct mlx5dr_domain *dmn;
+};
+
+struct mlx5dr_icm_dm {
+	u32 obj_id;
+	enum mlx5_sw_icm_type type;
+	u64 addr;
+	size_t length;
+};
+
+struct mlx5dr_icm_mr {
+	struct mlx5dr_icm_pool *pool;
+	struct mlx5_core_mkey mkey;
+	struct mlx5dr_icm_dm dm;
+	size_t used_length;
+	size_t length;
+	u64 icm_start_addr;
+	struct list_head mr_list;
+};
+
+static int dr_icm_create_dm_mkey(struct mlx5_core_dev *mdev,
+				 u32 pd, u64 length, u64 start_addr, int mode,
+				 struct mlx5_core_mkey *mkey)
+{
+	u32 inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
+	u32 in[MLX5_ST_SZ_DW(create_mkey_in)] = {};
+	void *mkc;
+
+	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
+
+	MLX5_SET(mkc, mkc, access_mode_1_0, mode);
+	MLX5_SET(mkc, mkc, access_mode_4_2, (mode >> 2) & 0x7);
+	MLX5_SET(mkc, mkc, lw, 1);
+	MLX5_SET(mkc, mkc, lr, 1);
+	if (mode == MLX5_MKC_ACCESS_MODE_SW_ICM) {
+		MLX5_SET(mkc, mkc, rw, 1);
+		MLX5_SET(mkc, mkc, rr, 1);
+	}
+
+	MLX5_SET64(mkc, mkc, len, length);
+	MLX5_SET(mkc, mkc, pd, pd);
+	MLX5_SET(mkc, mkc, qpn, 0xffffff);
+	MLX5_SET64(mkc, mkc, start_addr, start_addr);
+
+	return mlx5_core_create_mkey(mdev, mkey, in, inlen);
+}
+
+static struct mlx5dr_icm_mr *
+dr_icm_pool_mr_create(struct mlx5dr_icm_pool *pool,
+		      enum mlx5_sw_icm_type type,
+		      size_t align_base)
+{
+	struct mlx5_core_dev *mdev = pool->dmn->mdev;
+	struct mlx5dr_icm_mr *icm_mr;
+	size_t align_diff;
+	int err;
+
+	icm_mr = kvzalloc(sizeof(*icm_mr), GFP_KERNEL);
+	if (!icm_mr)
+		return NULL;
+
+	icm_mr->pool = pool;
+	INIT_LIST_HEAD(&icm_mr->mr_list);
+
+	icm_mr->dm.type = type;
+
+	/* 2^log_biggest_table * entry-size * double-for-alignment */
+	icm_mr->dm.length = mlx5dr_icm_pool_chunk_size_to_byte(pool->max_log_chunk_sz,
+							       pool->icm_type) * 2;
+
+	err = mlx5_dm_sw_icm_alloc(mdev, icm_mr->dm.type, icm_mr->dm.length, 0,
+				   &icm_mr->dm.addr, &icm_mr->dm.obj_id);
+	if (err) {
+		mlx5dr_err(pool->dmn, "Failed to allocate SW ICM memory, err (%d)\n", err);
+		goto free_icm_mr;
+	}
+
+	/* Register device memory */
+	err = dr_icm_create_dm_mkey(mdev, pool->dmn->pdn,
+				    icm_mr->dm.length,
+				    icm_mr->dm.addr,
+				    MLX5_MKC_ACCESS_MODE_SW_ICM,
+				    &icm_mr->mkey);
+	if (err) {
+		mlx5dr_err(pool->dmn, "Failed to create SW ICM MKEY, err (%d)\n", err);
+		goto free_dm;
+	}
+
+	icm_mr->icm_start_addr = icm_mr->dm.addr;
+
+	align_diff = icm_mr->icm_start_addr % align_base;
+	if (align_diff)
+		icm_mr->used_length = align_base - align_diff;
+
+	list_add_tail(&icm_mr->mr_list, &pool->icm_mr_list);
+
+	return icm_mr;
+
+free_dm:
+	mlx5_dm_sw_icm_dealloc(mdev, icm_mr->dm.type, icm_mr->dm.length, 0,
+			       icm_mr->dm.addr, icm_mr->dm.obj_id);
+free_icm_mr:
+	kvfree(icm_mr);
+	return NULL;
+}
+
+static void dr_icm_pool_mr_destroy(struct mlx5dr_icm_mr *icm_mr)
+{
+	struct mlx5_core_dev *mdev = icm_mr->pool->dmn->mdev;
+	struct mlx5dr_icm_dm *dm = &icm_mr->dm;
+
+	list_del(&icm_mr->mr_list);
+	mlx5_core_destroy_mkey(mdev, &icm_mr->mkey);
+	mlx5_dm_sw_icm_dealloc(mdev, dm->type, dm->length, 0,
+			       dm->addr, dm->obj_id);
+	kvfree(icm_mr);
+}
+
+static int dr_icm_chunk_ste_init(struct mlx5dr_icm_chunk *chunk)
+{
+	struct mlx5dr_icm_bucket *bucket = chunk->bucket;
+
+	chunk->ste_arr = kvzalloc(bucket->num_of_entries *
+				  sizeof(chunk->ste_arr[0]), GFP_KERNEL);
+	if (!chunk->ste_arr)
+		return -ENOMEM;
+
+	chunk->hw_ste_arr = kvzalloc(bucket->num_of_entries *
+				     DR_STE_SIZE_REDUCED, GFP_KERNEL);
+	if (!chunk->hw_ste_arr)
+		goto out_free_ste_arr;
+
+	chunk->miss_list = kvmalloc(bucket->num_of_entries *
+				    sizeof(chunk->miss_list[0]), GFP_KERNEL);
+	if (!chunk->miss_list)
+		goto out_free_hw_ste_arr;
+
+	return 0;
+
+out_free_hw_ste_arr:
+	kvfree(chunk->hw_ste_arr);
+out_free_ste_arr:
+	kvfree(chunk->ste_arr);
+	return -ENOMEM;
+}
+
+static int dr_icm_chunks_create(struct mlx5dr_icm_bucket *bucket)
+{
+	size_t mr_free_size, mr_req_size, mr_row_size;
+	struct mlx5dr_icm_pool *pool = bucket->pool;
+	struct mlx5dr_icm_mr *icm_mr = NULL;
+	struct mlx5dr_icm_chunk *chunk;
+	enum mlx5_sw_icm_type dm_type;
+	size_t align_base;
+	int i, err = 0;
+
+	mr_req_size = bucket->num_of_entries * bucket->entry_size;
+	mr_row_size = mlx5dr_icm_pool_chunk_size_to_byte(pool->max_log_chunk_sz,
+							 pool->icm_type);
+
+	if (pool->icm_type == DR_ICM_TYPE_STE) {
+		dm_type = MLX5_SW_ICM_TYPE_STEERING;
+		/* Align base is the biggest chunk size / row size */
+		align_base = mr_row_size;
+	} else {
+		dm_type = MLX5_SW_ICM_TYPE_HEADER_MODIFY;
+		/* Align base is 64B */
+		align_base = DR_ICM_MODIFY_HDR_ALIGN_BASE;
+	}
+
+	mutex_lock(&pool->mr_mutex);
+	if (!list_empty(&pool->icm_mr_list)) {
+		icm_mr = list_last_entry(&pool->icm_mr_list,
+					 struct mlx5dr_icm_mr, mr_list);
+
+		if (icm_mr)
+			mr_free_size = icm_mr->dm.length - icm_mr->used_length;
+	}
+
+	if (!icm_mr || mr_free_size < mr_row_size) {
+		icm_mr = dr_icm_pool_mr_create(pool, dm_type, align_base);
+		if (!icm_mr) {
+			err = -ENOMEM;
+			goto out_err;
+		}
+	}
+
+	/* Create memory aligned chunks */
+	for (i = 0; i < mr_row_size / mr_req_size; i++) {
+		chunk = kvzalloc(sizeof(*chunk), GFP_KERNEL);
+		if (!chunk) {
+			err = -ENOMEM;
+			goto out_err;
+		}
+
+		chunk->bucket = bucket;
+		chunk->rkey = icm_mr->mkey.key;
+		/* mr start addr is zero based */
+		chunk->mr_addr = icm_mr->used_length;
+		chunk->icm_addr = (uintptr_t)icm_mr->icm_start_addr + icm_mr->used_length;
+		icm_mr->used_length += mr_req_size;
+		chunk->num_of_entries = bucket->num_of_entries;
+		chunk->byte_size = chunk->num_of_entries * bucket->entry_size;
+
+		if (pool->icm_type == DR_ICM_TYPE_STE) {
+			err = dr_icm_chunk_ste_init(chunk);
+			if (err)
+				goto out_free_chunk;
+		}
+
+		INIT_LIST_HEAD(&chunk->chunk_list);
+		list_add(&chunk->chunk_list, &bucket->free_list);
+		bucket->free_list_count++;
+		bucket->total_chunks++;
+	}
+	mutex_unlock(&pool->mr_mutex);
+	return 0;
+
+out_free_chunk:
+	kvfree(chunk);
+out_err:
+	mutex_unlock(&pool->mr_mutex);
+	return err;
+}
+
+static void dr_icm_chunk_ste_cleanup(struct mlx5dr_icm_chunk *chunk)
+{
+	kvfree(chunk->miss_list);
+	kvfree(chunk->hw_ste_arr);
+	kvfree(chunk->ste_arr);
+}
+
+static void dr_icm_chunk_destroy(struct mlx5dr_icm_chunk *chunk)
+{
+	struct mlx5dr_icm_bucket *bucket = chunk->bucket;
+
+	list_del(&chunk->chunk_list);
+	bucket->total_chunks--;
+
+	if (bucket->pool->icm_type == DR_ICM_TYPE_STE)
+		dr_icm_chunk_ste_cleanup(chunk);
+
+	kvfree(chunk);
+}
+
+static void dr_icm_bucket_init(struct mlx5dr_icm_pool *pool,
+			       struct mlx5dr_icm_bucket *bucket,
+			       enum mlx5dr_icm_chunk_size chunk_size)
+{
+	if (pool->icm_type == DR_ICM_TYPE_STE)
+		bucket->entry_size = DR_STE_SIZE;
+	else
+		bucket->entry_size = DR_MODIFY_ACTION_SIZE;
+
+	bucket->num_of_entries = mlx5dr_icm_pool_chunk_size_to_entries(chunk_size);
+	bucket->pool = pool;
+	mutex_init(&bucket->mutex);
+	INIT_LIST_HEAD(&bucket->free_list);
+	INIT_LIST_HEAD(&bucket->used_list);
+	INIT_LIST_HEAD(&bucket->hot_list);
+	INIT_LIST_HEAD(&bucket->sync_list);
+}
+
+static void dr_icm_bucket_cleanup(struct mlx5dr_icm_bucket *bucket)
+{
+	struct mlx5dr_icm_chunk *chunk, *next;
+
+	mutex_destroy(&bucket->mutex);
+	list_splice_tail_init(&bucket->sync_list, &bucket->free_list);
+	list_splice_tail_init(&bucket->hot_list, &bucket->free_list);
+
+	list_for_each_entry_safe(chunk, next, &bucket->free_list, chunk_list)
+		dr_icm_chunk_destroy(chunk);
+
+	WARN_ON(bucket->total_chunks != 0);
+
+	/* Cleanup of unreturned chunks */
+	list_for_each_entry_safe(chunk, next, &bucket->used_list, chunk_list)
+		dr_icm_chunk_destroy(chunk);
+}
+
+static u64 dr_icm_hot_mem_size(struct mlx5dr_icm_pool *pool)
+{
+	u64 hot_size = 0;
+	int chunk_order;
+
+	for (chunk_order = 0; chunk_order < pool->num_of_buckets; chunk_order++)
+		hot_size += pool->buckets[chunk_order].hot_list_count *
+			    mlx5dr_icm_pool_chunk_size_to_byte(chunk_order, pool->icm_type);
+
+	return hot_size;
+}
+
+static bool dr_icm_reuse_hot_entries(struct mlx5dr_icm_pool *pool,
+				     struct mlx5dr_icm_bucket *bucket)
+{
+	u64 bytes_for_sync;
+
+	bytes_for_sync = dr_icm_hot_mem_size(pool);
+	if (bytes_for_sync < DR_ICM_SYNC_THRESHOLD || !bucket->hot_list_count)
+		return false;
+
+	return true;
+}
+
+static void dr_icm_chill_bucket_start(struct mlx5dr_icm_bucket *bucket)
+{
+	list_splice_tail_init(&bucket->hot_list, &bucket->sync_list);
+	bucket->sync_list_count += bucket->hot_list_count;
+	bucket->hot_list_count = 0;
+}
+
+static void dr_icm_chill_bucket_end(struct mlx5dr_icm_bucket *bucket)
+{
+	list_splice_tail_init(&bucket->sync_list, &bucket->free_list);
+	bucket->free_list_count += bucket->sync_list_count;
+	bucket->sync_list_count = 0;
+}
+
+static void dr_icm_chill_bucket_abort(struct mlx5dr_icm_bucket *bucket)
+{
+	list_splice_tail_init(&bucket->sync_list, &bucket->hot_list);
+	bucket->hot_list_count += bucket->sync_list_count;
+	bucket->sync_list_count = 0;
+}
+
+static void dr_icm_chill_buckets_start(struct mlx5dr_icm_pool *pool,
+				       struct mlx5dr_icm_bucket *cb,
+				       bool buckets[DR_CHUNK_SIZE_MAX])
+{
+	struct mlx5dr_icm_bucket *bucket;
+	int i;
+
+	for (i = 0; i < pool->num_of_buckets; i++) {
+		bucket = &pool->buckets[i];
+		if (bucket == cb) {
+			dr_icm_chill_bucket_start(bucket);
+			continue;
+		}
+
+		/* Freeing the mutex is done at the end of that process, after
+		 * sync_ste was executed at dr_icm_chill_buckets_end func.
+		 */
+		if (mutex_trylock(&bucket->mutex)) {
+			dr_icm_chill_bucket_start(bucket);
+			buckets[i] = true;
+		}
+	}
+}
+
+static void dr_icm_chill_buckets_end(struct mlx5dr_icm_pool *pool,
+				     struct mlx5dr_icm_bucket *cb,
+				     bool buckets[DR_CHUNK_SIZE_MAX])
+{
+	struct mlx5dr_icm_bucket *bucket;
+	int i;
+
+	for (i = 0; i < pool->num_of_buckets; i++) {
+		bucket = &pool->buckets[i];
+		if (bucket == cb) {
+			dr_icm_chill_bucket_end(bucket);
+			continue;
+		}
+
+		if (!buckets[i])
+			continue;
+
+		dr_icm_chill_bucket_end(bucket);
+		mutex_unlock(&bucket->mutex);
+	}
+}
+
+static void dr_icm_chill_buckets_abort(struct mlx5dr_icm_pool *pool,
+				       struct mlx5dr_icm_bucket *cb,
+				       bool buckets[DR_CHUNK_SIZE_MAX])
+{
+	struct mlx5dr_icm_bucket *bucket;
+	int i;
+
+	for (i = 0; i < pool->num_of_buckets; i++) {
+		bucket = &pool->buckets[i];
+		if (bucket == cb) {
+			dr_icm_chill_bucket_abort(bucket);
+			continue;
+		}
+
+		if (!buckets[i])
+			continue;
+
+		dr_icm_chill_bucket_abort(bucket);
+		mutex_unlock(&bucket->mutex);
+	}
+}
+
+/* Allocate an ICM chunk, each chunk holds a piece of ICM memory and
+ * also memory used for HW STE management for optimizations.
+ */
+struct mlx5dr_icm_chunk *
+mlx5dr_icm_alloc_chunk(struct mlx5dr_icm_pool *pool,
+		       enum mlx5dr_icm_chunk_size chunk_size)
+{
+	struct mlx5dr_icm_chunk *chunk = NULL; /* Fix compilation warning */
+	bool buckets[DR_CHUNK_SIZE_MAX] = {};
+	struct mlx5dr_icm_bucket *bucket;
+	int err;
+
+	if (chunk_size > pool->max_log_chunk_sz)
+		return NULL;
+
+	bucket = &pool->buckets[chunk_size];
+
+	mutex_lock(&bucket->mutex);
+
+	/* Take chunk from pool if available, otherwise allocate new chunks */
+	if (list_empty(&bucket->free_list)) {
+		if (dr_icm_reuse_hot_entries(pool, bucket)) {
+			dr_icm_chill_buckets_start(pool, bucket, buckets);
+			err = mlx5dr_cmd_sync_steering(pool->dmn->mdev);
+			if (err) {
+				dr_icm_chill_buckets_abort(pool, bucket, buckets);
+				mlx5dr_dbg(pool->dmn, "Sync_steering failed\n");
+				chunk = NULL;
+				goto out;
+			}
+			dr_icm_chill_buckets_end(pool, bucket, buckets);
+		} else {
+			dr_icm_chunks_create(bucket);
+		}
+	}
+
+	if (!list_empty(&bucket->free_list)) {
+		chunk = list_last_entry(&bucket->free_list,
+					struct mlx5dr_icm_chunk,
+					chunk_list);
+		if (chunk) {
+			list_del_init(&chunk->chunk_list);
+			list_add_tail(&chunk->chunk_list, &bucket->used_list);
+			bucket->free_list_count--;
+			bucket->used_list_count++;
+		}
+	}
+out:
+	mutex_unlock(&bucket->mutex);
+	return chunk;
+}
+
+void mlx5dr_icm_free_chunk(struct mlx5dr_icm_chunk *chunk)
+{
+	struct mlx5dr_icm_bucket *bucket = chunk->bucket;
+
+	if (bucket->pool->icm_type == DR_ICM_TYPE_STE) {
+		memset(chunk->ste_arr, 0,
+		       bucket->num_of_entries * sizeof(chunk->ste_arr[0]));
+		memset(chunk->hw_ste_arr, 0,
+		       bucket->num_of_entries * DR_STE_SIZE_REDUCED);
+	}
+
+	mutex_lock(&bucket->mutex);
+	list_del_init(&chunk->chunk_list);
+	list_add_tail(&chunk->chunk_list, &bucket->hot_list);
+	bucket->hot_list_count++;
+	bucket->used_list_count--;
+	mutex_unlock(&bucket->mutex);
+}
+
+struct mlx5dr_icm_pool *mlx5dr_icm_pool_create(struct mlx5dr_domain *dmn,
+					       enum mlx5dr_icm_type icm_type)
+{
+	enum mlx5dr_icm_chunk_size max_log_chunk_sz;
+	struct mlx5dr_icm_pool *pool;
+	int i;
+
+	if (icm_type == DR_ICM_TYPE_STE)
+		max_log_chunk_sz = dmn->info.max_log_sw_icm_sz;
+	else
+		max_log_chunk_sz = dmn->info.max_log_action_icm_sz;
+
+	pool = kvzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		return NULL;
+
+	pool->buckets = kcalloc(max_log_chunk_sz + 1,
+				sizeof(pool->buckets[0]),
+				GFP_KERNEL);
+	if (!pool->buckets)
+		goto free_pool;
+
+	pool->dmn = dmn;
+	pool->icm_type = icm_type;
+	pool->max_log_chunk_sz = max_log_chunk_sz;
+	pool->num_of_buckets = max_log_chunk_sz + 1;
+	INIT_LIST_HEAD(&pool->icm_mr_list);
+
+	for (i = 0; i < pool->num_of_buckets; i++)
+		dr_icm_bucket_init(pool, &pool->buckets[i], i);
+
+	mutex_init(&pool->mr_mutex);
+
+	return pool;
+
+free_pool:
+	kvfree(pool);
+	return NULL;
+}
+
+void mlx5dr_icm_pool_destroy(struct mlx5dr_icm_pool *pool)
+{
+	struct mlx5dr_icm_mr *icm_mr, *next;
+	int i;
+
+	mutex_destroy(&pool->mr_mutex);
+
+	list_for_each_entry_safe(icm_mr, next, &pool->icm_mr_list, mr_list)
+		dr_icm_pool_mr_destroy(icm_mr);
+
+	for (i = 0; i < pool->num_of_buckets; i++)
+		dr_icm_bucket_cleanup(&pool->buckets[i]);
+
+	kfree(pool->buckets);
+	kvfree(pool);
+}
-- 
2.21.0


^ permalink raw reply related

* [net-next 01/18] net/mlx5: Add flow steering actions to fs_cmd shim layer
From: Saeed Mahameed @ 2019-09-02  7:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Alex Vesker, Erez Shitrit, Maor Gottlieb,
	Mark Bloch, Saeed Mahameed
In-Reply-To: <20190902072213.7683-1-saeedm@mellanox.com>

From: Maor Gottlieb <maorg@mellanox.com>

Add flow steering actions: modify header and packet reformat
to the fs_cmd shim layer. This allows each namespace to define
possibly different functionality for alloc/dealloc action commands.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/infiniband/hw/mlx5/flow.c             |  18 +--
 drivers/infiniband/hw/mlx5/main.c             |   7 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h          |   5 +-
 .../ethernet/mellanox/mlx5/core/en/tc_tun.c   |  27 ++---
 .../net/ethernet/mellanox/mlx5/core/en_rep.h  |   2 +-
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   |  46 ++++----
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |   6 +-
 .../mellanox/mlx5/core/eswitch_offloads.c     |  26 +++--
 .../net/ethernet/mellanox/mlx5/core/fs_cmd.c  |  92 ++++++++++-----
 .../net/ethernet/mellanox/mlx5/core/fs_cmd.h  |  18 +++
 .../net/ethernet/mellanox/mlx5/core/fs_core.c | 105 +++++++++++++++++-
 .../net/ethernet/mellanox/mlx5/core/fs_core.h |  11 ++
 include/linux/mlx5/fs.h                       |  33 +++---
 13 files changed, 286 insertions(+), 110 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/flow.c b/drivers/infiniband/hw/mlx5/flow.c
index b8841355fcd5..324acc7b5b9e 100644
--- a/drivers/infiniband/hw/mlx5/flow.c
+++ b/drivers/infiniband/hw/mlx5/flow.c
@@ -322,11 +322,11 @@ void mlx5_ib_destroy_flow_action_raw(struct mlx5_ib_flow_action *maction)
 	switch (maction->flow_action_raw.sub_type) {
 	case MLX5_IB_FLOW_ACTION_MODIFY_HEADER:
 		mlx5_modify_header_dealloc(maction->flow_action_raw.dev->mdev,
-					   maction->flow_action_raw.action_id);
+					   maction->flow_action_raw.modify_hdr);
 		break;
 	case MLX5_IB_FLOW_ACTION_PACKET_REFORMAT:
 		mlx5_packet_reformat_dealloc(maction->flow_action_raw.dev->mdev,
-			maction->flow_action_raw.action_id);
+					     maction->flow_action_raw.pkt_reformat);
 		break;
 	case MLX5_IB_FLOW_ACTION_DECAP:
 		break;
@@ -352,10 +352,10 @@ mlx5_ib_create_modify_header(struct mlx5_ib_dev *dev,
 	if (!maction)
 		return ERR_PTR(-ENOMEM);
 
-	ret = mlx5_modify_header_alloc(dev->mdev, namespace, num_actions, in,
-				       &maction->flow_action_raw.action_id);
+	maction->flow_action_raw.modify_hdr =
+		mlx5_modify_header_alloc(dev->mdev, namespace, num_actions, in);
 
-	if (ret) {
+	if (IS_ERR(maction->flow_action_raw.modify_hdr)) {
 		kfree(maction);
 		return ERR_PTR(ret);
 	}
@@ -479,10 +479,10 @@ static int mlx5_ib_flow_action_create_packet_reformat_ctx(
 	if (ret)
 		return ret;
 
-	ret = mlx5_packet_reformat_alloc(dev->mdev, prm_prt, len,
-					 in, namespace,
-					 &maction->flow_action_raw.action_id);
-	if (ret)
+	maction->flow_action_raw.pkt_reformat =
+		mlx5_packet_reformat_alloc(dev->mdev, prm_prt, len,
+					   in, namespace);
+	if (IS_ERR(maction->flow_action_raw.pkt_reformat))
 		return ret;
 
 	maction->flow_action_raw.sub_type =
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 016373d1d27e..4e9f1507ffd9 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2658,7 +2658,8 @@ int parse_flow_flow_action(struct mlx5_ib_flow_action *maction,
 			if (action->action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR)
 				return -EINVAL;
 			action->action |= MLX5_FLOW_CONTEXT_ACTION_MOD_HDR;
-			action->modify_id = maction->flow_action_raw.action_id;
+			action->modify_hdr =
+				maction->flow_action_raw.modify_hdr;
 			return 0;
 		}
 		if (maction->flow_action_raw.sub_type ==
@@ -2675,8 +2676,8 @@ int parse_flow_flow_action(struct mlx5_ib_flow_action *maction,
 				return -EINVAL;
 			action->action |=
 				MLX5_FLOW_CONTEXT_ACTION_PACKET_REFORMAT;
-			action->reformat_id =
-				maction->flow_action_raw.action_id;
+			action->pkt_reformat =
+				maction->flow_action_raw.pkt_reformat;
 			return 0;
 		}
 		/* fall through */
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index a20d2ee08a3b..125a507c10ed 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -868,7 +868,10 @@ struct mlx5_ib_flow_action {
 		struct {
 			struct mlx5_ib_dev *dev;
 			u32 sub_type;
-			u32 action_id;
+			union {
+				struct mlx5_modify_hdr *modify_hdr;
+				struct mlx5_pkt_reformat *pkt_reformat;
+			};
 		} flow_action_raw;
 	};
 };
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
index 4c4620db3d31..f8ee18b4da6f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
@@ -291,14 +291,14 @@ int mlx5e_tc_tun_create_header_ipv4(struct mlx5e_priv *priv,
 		 */
 		goto out;
 	}
-
-	err = mlx5_packet_reformat_alloc(priv->mdev,
-					 e->reformat_type,
-					 ipv4_encap_size, encap_header,
-					 MLX5_FLOW_NAMESPACE_FDB,
-					 &e->encap_id);
-	if (err)
+	e->pkt_reformat = mlx5_packet_reformat_alloc(priv->mdev,
+						     e->reformat_type,
+						     ipv4_encap_size, encap_header,
+						     MLX5_FLOW_NAMESPACE_FDB);
+	if (IS_ERR(e->pkt_reformat)) {
+		err = PTR_ERR(e->pkt_reformat);
 		goto destroy_neigh_entry;
+	}
 
 	e->flags |= MLX5_ENCAP_ENTRY_VALID;
 	mlx5e_rep_queue_neigh_stats_work(netdev_priv(out_dev));
@@ -407,13 +407,14 @@ int mlx5e_tc_tun_create_header_ipv6(struct mlx5e_priv *priv,
 		goto out;
 	}
 
-	err = mlx5_packet_reformat_alloc(priv->mdev,
-					 e->reformat_type,
-					 ipv6_encap_size, encap_header,
-					 MLX5_FLOW_NAMESPACE_FDB,
-					 &e->encap_id);
-	if (err)
+	e->pkt_reformat = mlx5_packet_reformat_alloc(priv->mdev,
+						     e->reformat_type,
+						     ipv6_encap_size, encap_header,
+						     MLX5_FLOW_NAMESPACE_FDB);
+	if (IS_ERR(e->pkt_reformat)) {
+		err = PTR_ERR(e->pkt_reformat);
 		goto destroy_neigh_entry;
+	}
 
 	e->flags |= MLX5_ENCAP_ENTRY_VALID;
 	mlx5e_rep_queue_neigh_stats_work(netdev_priv(out_dev));
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h
index a0ae5069d8c3..8e512216deb8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h
@@ -161,7 +161,7 @@ struct mlx5e_encap_entry {
 	 */
 	struct hlist_node encap_hlist;
 	struct list_head flows;
-	u32 encap_id;
+	struct mlx5_pkt_reformat *pkt_reformat;
 	const struct ip_tunnel_info *tun_info;
 	unsigned char h_dest[ETH_ALEN];	/* destination eth addr	*/
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 67f66412a33c..30d26eba75a3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -61,7 +61,7 @@
 struct mlx5_nic_flow_attr {
 	u32 action;
 	u32 flow_tag;
-	u32 mod_hdr_id;
+	struct mlx5_modify_hdr *modify_hdr;
 	u32 hairpin_tirn;
 	u8 match_level;
 	struct mlx5_flow_table	*hairpin_ft;
@@ -201,7 +201,7 @@ struct mlx5e_mod_hdr_entry {
 
 	struct mod_hdr_key key;
 
-	u32 mod_hdr_id;
+	struct mlx5_modify_hdr *modify_hdr;
 
 	refcount_t refcnt;
 	struct completion res_ready;
@@ -334,7 +334,7 @@ static void mlx5e_mod_hdr_put(struct mlx5e_priv *priv,
 
 	WARN_ON(!list_empty(&mh->flows));
 	if (mh->compl_result > 0)
-		mlx5_modify_header_dealloc(priv->mdev, mh->mod_hdr_id);
+		mlx5_modify_header_dealloc(priv->mdev, mh->modify_hdr);
 
 	kfree(mh);
 }
@@ -395,11 +395,11 @@ static int mlx5e_attach_mod_hdr(struct mlx5e_priv *priv,
 	hash_add(tbl->hlist, &mh->mod_hdr_hlist, hash_key);
 	mutex_unlock(&tbl->lock);
 
-	err = mlx5_modify_header_alloc(priv->mdev, namespace,
-				       mh->key.num_actions,
-				       mh->key.actions,
-				       &mh->mod_hdr_id);
-	if (err) {
+	mh->modify_hdr = mlx5_modify_header_alloc(priv->mdev, namespace,
+						  mh->key.num_actions,
+						  mh->key.actions);
+	if (IS_ERR(mh->modify_hdr)) {
+		err = PTR_ERR(mh->modify_hdr);
 		mh->compl_result = err;
 		goto alloc_header_err;
 	}
@@ -412,9 +412,9 @@ static int mlx5e_attach_mod_hdr(struct mlx5e_priv *priv,
 	list_add(&flow->mod_hdr, &mh->flows);
 	spin_unlock(&mh->flows_lock);
 	if (mlx5e_is_eswitch_flow(flow))
-		flow->esw_attr->mod_hdr_id = mh->mod_hdr_id;
+		flow->esw_attr->modify_hdr = mh->modify_hdr;
 	else
-		flow->nic_attr->mod_hdr_id = mh->mod_hdr_id;
+		flow->nic_attr->modify_hdr = mh->modify_hdr;
 
 	return 0;
 
@@ -906,7 +906,6 @@ mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
 	struct mlx5_flow_destination dest[2] = {};
 	struct mlx5_flow_act flow_act = {
 		.action = attr->action,
-		.reformat_id = 0,
 		.flags    = FLOW_ACT_NO_APPEND,
 	};
 	struct mlx5_fc *counter = NULL;
@@ -947,7 +946,7 @@ mlx5e_tc_add_nic_flow(struct mlx5e_priv *priv,
 
 	if (attr->action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR) {
 		err = mlx5e_attach_mod_hdr(priv, flow, parse_attr);
-		flow_act.modify_id = attr->mod_hdr_id;
+		flow_act.modify_hdr = attr->modify_hdr;
 		kfree(parse_attr->mod_hdr_actions);
 		if (err)
 			return err;
@@ -1304,14 +1303,13 @@ void mlx5e_tc_encap_flows_add(struct mlx5e_priv *priv,
 	struct mlx5e_tc_flow *flow;
 	int err;
 
-	err = mlx5_packet_reformat_alloc(priv->mdev,
-					 e->reformat_type,
-					 e->encap_size, e->encap_header,
-					 MLX5_FLOW_NAMESPACE_FDB,
-					 &e->encap_id);
-	if (err) {
-		mlx5_core_warn(priv->mdev, "Failed to offload cached encapsulation header, %d\n",
-			       err);
+	e->pkt_reformat = mlx5_packet_reformat_alloc(priv->mdev,
+						     e->reformat_type,
+						     e->encap_size, e->encap_header,
+						     MLX5_FLOW_NAMESPACE_FDB);
+	if (IS_ERR(e->pkt_reformat)) {
+		mlx5_core_warn(priv->mdev, "Failed to offload cached encapsulation header, %lu\n",
+			       PTR_ERR(e->pkt_reformat));
 		return;
 	}
 	e->flags |= MLX5_ENCAP_ENTRY_VALID;
@@ -1326,7 +1324,7 @@ void mlx5e_tc_encap_flows_add(struct mlx5e_priv *priv,
 		esw_attr = flow->esw_attr;
 		spec = &esw_attr->parse_attr->spec;
 
-		esw_attr->dests[flow->tmp_efi_index].encap_id = e->encap_id;
+		esw_attr->dests[flow->tmp_efi_index].pkt_reformat = e->pkt_reformat;
 		esw_attr->dests[flow->tmp_efi_index].flags |= MLX5_ESW_DEST_ENCAP_VALID;
 		/* Flow can be associated with multiple encap entries.
 		 * Before offloading the flow verify that all of them have
@@ -1395,7 +1393,7 @@ void mlx5e_tc_encap_flows_del(struct mlx5e_priv *priv,
 
 	/* we know that the encap is valid */
 	e->flags &= ~MLX5_ENCAP_ENTRY_VALID;
-	mlx5_packet_reformat_dealloc(priv->mdev, e->encap_id);
+	mlx5_packet_reformat_dealloc(priv->mdev, e->pkt_reformat);
 }
 
 static struct mlx5_fc *mlx5e_tc_get_counter(struct mlx5e_tc_flow *flow)
@@ -1561,7 +1559,7 @@ static void mlx5e_encap_dealloc(struct mlx5e_priv *priv, struct mlx5e_encap_entr
 		mlx5e_rep_encap_entry_detach(netdev_priv(e->out_dev), e);
 
 		if (e->flags & MLX5_ENCAP_ENTRY_VALID)
-			mlx5_packet_reformat_dealloc(priv->mdev, e->encap_id);
+			mlx5_packet_reformat_dealloc(priv->mdev, e->pkt_reformat);
 	}
 
 	kfree(e->encap_header);
@@ -3048,7 +3046,7 @@ static int mlx5e_attach_encap(struct mlx5e_priv *priv,
 	flow->encaps[out_index].index = out_index;
 	*encap_dev = e->out_dev;
 	if (e->flags & MLX5_ENCAP_ENTRY_VALID) {
-		attr->dests[out_index].encap_id = e->encap_id;
+		attr->dests[out_index].pkt_reformat = e->pkt_reformat;
 		attr->dests[out_index].flags |= MLX5_ESW_DEST_ENCAP_VALID;
 		*encap_valid = true;
 	} else {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index aba9e7a6ad3c..4f70202db6af 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -69,7 +69,7 @@ struct vport_ingress {
 	struct mlx5_flow_group *allow_spoofchk_only_grp;
 	struct mlx5_flow_group *allow_untagged_only_grp;
 	struct mlx5_flow_group *drop_grp;
-	int modify_metadata_id;
+	struct mlx5_modify_hdr   *modify_metadata;
 	struct mlx5_flow_handle  *modify_metadata_rule;
 	struct mlx5_flow_handle  *allow_rule;
 	struct mlx5_flow_handle  *drop_rule;
@@ -385,11 +385,11 @@ struct mlx5_esw_flow_attr {
 	struct {
 		u32 flags;
 		struct mlx5_eswitch_rep *rep;
+		struct mlx5_pkt_reformat *pkt_reformat;
 		struct mlx5_core_dev *mdev;
-		u32 encap_id;
 		struct mlx5_termtbl_handle *termtbl;
 	} dests[MLX5_MAX_FLOW_FWD_VPORTS];
-	u32	mod_hdr_id;
+	struct  mlx5_modify_hdr *modify_hdr;
 	u8	inner_match_level;
 	u8	outer_match_level;
 	struct mlx5_fc *counter;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 7d3582ee66b7..bee67ff58137 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -190,10 +190,10 @@ mlx5_eswitch_add_offloaded_rule(struct mlx5_eswitch *esw,
 						MLX5_FLOW_DEST_VPORT_VHCA_ID;
 				if (attr->dests[j].flags & MLX5_ESW_DEST_ENCAP) {
 					flow_act.action |= MLX5_FLOW_CONTEXT_ACTION_PACKET_REFORMAT;
-					flow_act.reformat_id = attr->dests[j].encap_id;
+					flow_act.pkt_reformat = attr->dests[j].pkt_reformat;
 					dest[i].vport.flags |= MLX5_FLOW_DEST_VPORT_REFORMAT_ID;
-					dest[i].vport.reformat_id =
-						attr->dests[j].encap_id;
+					dest[i].vport.pkt_reformat =
+						attr->dests[j].pkt_reformat;
 				}
 				i++;
 			}
@@ -213,7 +213,7 @@ mlx5_eswitch_add_offloaded_rule(struct mlx5_eswitch *esw,
 		spec->match_criteria_enable |= MLX5_MATCH_INNER_HEADERS;
 
 	if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR)
-		flow_act.modify_id = attr->mod_hdr_id;
+		flow_act.modify_hdr = attr->modify_hdr;
 
 	fdb = esw_get_prio_table(esw, attr->chain, attr->prio, !!split);
 	if (IS_ERR(fdb)) {
@@ -276,7 +276,7 @@ mlx5_eswitch_add_fwd_rule(struct mlx5_eswitch *esw,
 			dest[i].vport.flags |= MLX5_FLOW_DEST_VPORT_VHCA_ID;
 		if (attr->dests[i].flags & MLX5_ESW_DEST_ENCAP) {
 			dest[i].vport.flags |= MLX5_FLOW_DEST_VPORT_REFORMAT_ID;
-			dest[i].vport.reformat_id = attr->dests[i].encap_id;
+			dest[i].vport.pkt_reformat = attr->dests[i].pkt_reformat;
 		}
 	}
 	dest[i].type = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
@@ -1734,7 +1734,7 @@ static int esw_vport_ingress_prio_tag_config(struct mlx5_eswitch *esw,
 
 	if (vport->ingress.modify_metadata_rule) {
 		flow_act.action |= MLX5_FLOW_CONTEXT_ACTION_MOD_HDR;
-		flow_act.modify_id = vport->ingress.modify_metadata_id;
+		flow_act.modify_hdr = vport->ingress.modify_metadata;
 	}
 
 	vport->ingress.allow_rule =
@@ -1770,9 +1770,11 @@ static int esw_vport_add_ingress_acl_modify_metadata(struct mlx5_eswitch *esw,
 	MLX5_SET(set_action_in, action, data,
 		 mlx5_eswitch_get_vport_metadata_for_match(esw, vport->vport));
 
-	err = mlx5_modify_header_alloc(esw->dev, MLX5_FLOW_NAMESPACE_ESW_INGRESS,
-				       1, action, &vport->ingress.modify_metadata_id);
-	if (err) {
+	vport->ingress.modify_metadata =
+		mlx5_modify_header_alloc(esw->dev, MLX5_FLOW_NAMESPACE_ESW_INGRESS,
+					 1, action);
+	if (IS_ERR(vport->ingress.modify_metadata)) {
+		err = PTR_ERR(vport->ingress.modify_metadata);
 		esw_warn(esw->dev,
 			 "failed to alloc modify header for vport %d ingress acl (%d)\n",
 			 vport->vport, err);
@@ -1780,7 +1782,7 @@ static int esw_vport_add_ingress_acl_modify_metadata(struct mlx5_eswitch *esw,
 	}
 
 	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_MOD_HDR | MLX5_FLOW_CONTEXT_ACTION_ALLOW;
-	flow_act.modify_id = vport->ingress.modify_metadata_id;
+	flow_act.modify_hdr = vport->ingress.modify_metadata;
 	vport->ingress.modify_metadata_rule = mlx5_add_flow_rules(vport->ingress.acl,
 								  &spec, &flow_act, NULL, 0);
 	if (IS_ERR(vport->ingress.modify_metadata_rule)) {
@@ -1794,7 +1796,7 @@ static int esw_vport_add_ingress_acl_modify_metadata(struct mlx5_eswitch *esw,
 
 out:
 	if (err)
-		mlx5_modify_header_dealloc(esw->dev, vport->ingress.modify_metadata_id);
+		mlx5_modify_header_dealloc(esw->dev, vport->ingress.modify_metadata);
 	return err;
 }
 
@@ -1803,7 +1805,7 @@ void esw_vport_del_ingress_acl_modify_metadata(struct mlx5_eswitch *esw,
 {
 	if (vport->ingress.modify_metadata_rule) {
 		mlx5_del_flow_rules(vport->ingress.modify_metadata_rule);
-		mlx5_modify_header_dealloc(esw->dev, vport->ingress.modify_metadata_id);
+		mlx5_modify_header_dealloc(esw->dev, vport->ingress.modify_metadata);
 
 		vport->ingress.modify_metadata_rule = NULL;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index 1e3381604b3d..488f50dfb404 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -107,6 +107,34 @@ static int mlx5_cmd_stub_delete_fte(struct mlx5_flow_root_namespace *ns,
 	return 0;
 }
 
+static int mlx5_cmd_stub_packet_reformat_alloc(struct mlx5_flow_root_namespace *ns,
+					       int reformat_type,
+					       size_t size,
+					       void *reformat_data,
+					       enum mlx5_flow_namespace_type namespace,
+					       struct mlx5_pkt_reformat *pkt_reformat)
+{
+	return 0;
+}
+
+static void mlx5_cmd_stub_packet_reformat_dealloc(struct mlx5_flow_root_namespace *ns,
+						  struct mlx5_pkt_reformat *pkt_reformat)
+{
+}
+
+static int mlx5_cmd_stub_modify_header_alloc(struct mlx5_flow_root_namespace *ns,
+					     u8 namespace, u8 num_actions,
+					     void *modify_actions,
+					     struct mlx5_modify_hdr *modify_hdr)
+{
+	return 0;
+}
+
+static void mlx5_cmd_stub_modify_header_dealloc(struct mlx5_flow_root_namespace *ns,
+						struct mlx5_modify_hdr *modify_hdr)
+{
+}
+
 static int mlx5_cmd_update_root_ft(struct mlx5_flow_root_namespace *ns,
 				   struct mlx5_flow_table *ft, u32 underlay_qpn,
 				   bool disconnect)
@@ -412,11 +440,13 @@ static int mlx5_cmd_set_fte(struct mlx5_core_dev *dev,
 	} else {
 		MLX5_SET(flow_context, in_flow_context, action,
 			 fte->action.action);
-		MLX5_SET(flow_context, in_flow_context, packet_reformat_id,
-			 fte->action.reformat_id);
+		if (fte->action.pkt_reformat)
+			MLX5_SET(flow_context, in_flow_context, packet_reformat_id,
+				 fte->action.pkt_reformat->id);
 	}
-	MLX5_SET(flow_context, in_flow_context, modify_header_id,
-		 fte->action.modify_id);
+	if (fte->action.modify_hdr)
+		MLX5_SET(flow_context, in_flow_context, modify_header_id,
+			 fte->action.modify_hdr->id);
 
 	vlan = MLX5_ADDR_OF(flow_context, in_flow_context, push_vlan);
 
@@ -468,7 +498,7 @@ static int mlx5_cmd_set_fte(struct mlx5_core_dev *dev,
 						    MLX5_FLOW_DEST_VPORT_REFORMAT_ID));
 					MLX5_SET(extended_dest_format, in_dests,
 						 packet_reformat_id,
-						 dst->dest_attr.vport.reformat_id);
+						 dst->dest_attr.vport.pkt_reformat->id);
 				}
 				break;
 			default:
@@ -643,14 +673,15 @@ int mlx5_cmd_fc_bulk_query(struct mlx5_core_dev *dev, u32 base_id, int bulk_len,
 	return mlx5_cmd_exec(dev, in, sizeof(in), out, outlen);
 }
 
-int mlx5_packet_reformat_alloc(struct mlx5_core_dev *dev,
-			       int reformat_type,
-			       size_t size,
-			       void *reformat_data,
-			       enum mlx5_flow_namespace_type namespace,
-			       u32 *packet_reformat_id)
+static int mlx5_cmd_packet_reformat_alloc(struct mlx5_flow_root_namespace *ns,
+					  int reformat_type,
+					  size_t size,
+					  void *reformat_data,
+					  enum mlx5_flow_namespace_type namespace,
+					  struct mlx5_pkt_reformat *pkt_reformat)
 {
 	u32 out[MLX5_ST_SZ_DW(alloc_packet_reformat_context_out)];
+	struct mlx5_core_dev *dev = ns->dev;
 	void *packet_reformat_context_in;
 	int max_encap_size;
 	void *reformat;
@@ -693,35 +724,36 @@ int mlx5_packet_reformat_alloc(struct mlx5_core_dev *dev,
 	memset(out, 0, sizeof(out));
 	err = mlx5_cmd_exec(dev, in, inlen, out, sizeof(out));
 
-	*packet_reformat_id = MLX5_GET(alloc_packet_reformat_context_out,
-				       out, packet_reformat_id);
+	pkt_reformat->id = MLX5_GET(alloc_packet_reformat_context_out,
+				    out, packet_reformat_id);
 	kfree(in);
 	return err;
 }
-EXPORT_SYMBOL(mlx5_packet_reformat_alloc);
 
-void mlx5_packet_reformat_dealloc(struct mlx5_core_dev *dev,
-				  u32 packet_reformat_id)
+static void mlx5_cmd_packet_reformat_dealloc(struct mlx5_flow_root_namespace *ns,
+					     struct mlx5_pkt_reformat *pkt_reformat)
 {
 	u32 in[MLX5_ST_SZ_DW(dealloc_packet_reformat_context_in)];
 	u32 out[MLX5_ST_SZ_DW(dealloc_packet_reformat_context_out)];
+	struct mlx5_core_dev *dev = ns->dev;
 
 	memset(in, 0, sizeof(in));
 	MLX5_SET(dealloc_packet_reformat_context_in, in, opcode,
 		 MLX5_CMD_OP_DEALLOC_PACKET_REFORMAT_CONTEXT);
 	MLX5_SET(dealloc_packet_reformat_context_in, in, packet_reformat_id,
-		 packet_reformat_id);
+		 pkt_reformat->id);
 
 	mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
 }
-EXPORT_SYMBOL(mlx5_packet_reformat_dealloc);
 
-int mlx5_modify_header_alloc(struct mlx5_core_dev *dev,
-			     u8 namespace, u8 num_actions,
-			     void *modify_actions, u32 *modify_header_id)
+static int mlx5_cmd_modify_header_alloc(struct mlx5_flow_root_namespace *ns,
+					u8 namespace, u8 num_actions,
+					void *modify_actions,
+					struct mlx5_modify_hdr *modify_hdr)
 {
 	u32 out[MLX5_ST_SZ_DW(alloc_modify_header_context_out)];
 	int max_actions, actions_size, inlen, err;
+	struct mlx5_core_dev *dev = ns->dev;
 	void *actions_in;
 	u8 table_type;
 	u32 *in;
@@ -772,26 +804,26 @@ int mlx5_modify_header_alloc(struct mlx5_core_dev *dev,
 	memset(out, 0, sizeof(out));
 	err = mlx5_cmd_exec(dev, in, inlen, out, sizeof(out));
 
-	*modify_header_id = MLX5_GET(alloc_modify_header_context_out, out, modify_header_id);
+	modify_hdr->id = MLX5_GET(alloc_modify_header_context_out, out, modify_header_id);
 	kfree(in);
 	return err;
 }
-EXPORT_SYMBOL(mlx5_modify_header_alloc);
 
-void mlx5_modify_header_dealloc(struct mlx5_core_dev *dev, u32 modify_header_id)
+static void mlx5_cmd_modify_header_dealloc(struct mlx5_flow_root_namespace *ns,
+					   struct mlx5_modify_hdr *modify_hdr)
 {
 	u32 in[MLX5_ST_SZ_DW(dealloc_modify_header_context_in)];
 	u32 out[MLX5_ST_SZ_DW(dealloc_modify_header_context_out)];
+	struct mlx5_core_dev *dev = ns->dev;
 
 	memset(in, 0, sizeof(in));
 	MLX5_SET(dealloc_modify_header_context_in, in, opcode,
 		 MLX5_CMD_OP_DEALLOC_MODIFY_HEADER_CONTEXT);
 	MLX5_SET(dealloc_modify_header_context_in, in, modify_header_id,
-		 modify_header_id);
+		 modify_hdr->id);
 
 	mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
 }
-EXPORT_SYMBOL(mlx5_modify_header_dealloc);
 
 static const struct mlx5_flow_cmds mlx5_flow_cmds = {
 	.create_flow_table = mlx5_cmd_create_flow_table,
@@ -803,6 +835,10 @@ static const struct mlx5_flow_cmds mlx5_flow_cmds = {
 	.update_fte = mlx5_cmd_update_fte,
 	.delete_fte = mlx5_cmd_delete_fte,
 	.update_root_ft = mlx5_cmd_update_root_ft,
+	.packet_reformat_alloc = mlx5_cmd_packet_reformat_alloc,
+	.packet_reformat_dealloc = mlx5_cmd_packet_reformat_dealloc,
+	.modify_header_alloc = mlx5_cmd_modify_header_alloc,
+	.modify_header_dealloc = mlx5_cmd_modify_header_dealloc
 };
 
 static const struct mlx5_flow_cmds mlx5_flow_cmd_stubs = {
@@ -815,6 +851,10 @@ static const struct mlx5_flow_cmds mlx5_flow_cmd_stubs = {
 	.update_fte = mlx5_cmd_stub_update_fte,
 	.delete_fte = mlx5_cmd_stub_delete_fte,
 	.update_root_ft = mlx5_cmd_stub_update_root_ft,
+	.packet_reformat_alloc = mlx5_cmd_stub_packet_reformat_alloc,
+	.packet_reformat_dealloc = mlx5_cmd_stub_packet_reformat_dealloc,
+	.modify_header_alloc = mlx5_cmd_stub_modify_header_alloc,
+	.modify_header_dealloc = mlx5_cmd_stub_modify_header_dealloc
 };
 
 static const struct mlx5_flow_cmds *mlx5_fs_cmd_get_fw_cmds(void)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h
index bc4606306009..3268654d6748 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h
@@ -75,6 +75,24 @@ struct mlx5_flow_cmds {
 			      struct mlx5_flow_table *ft,
 			      u32 underlay_qpn,
 			      bool disconnect);
+
+	int (*packet_reformat_alloc)(struct mlx5_flow_root_namespace *ns,
+				     int reformat_type,
+				     size_t size,
+				     void *reformat_data,
+				     enum mlx5_flow_namespace_type namespace,
+				     struct mlx5_pkt_reformat *pkt_reformat);
+
+	void (*packet_reformat_dealloc)(struct mlx5_flow_root_namespace *ns,
+					struct mlx5_pkt_reformat *pkt_reformat);
+
+	int (*modify_header_alloc)(struct mlx5_flow_root_namespace *ns,
+				   u8 namespace, u8 num_actions,
+				   void *modify_actions,
+				   struct mlx5_modify_hdr *modify_hdr);
+
+	void (*modify_header_dealloc)(struct mlx5_flow_root_namespace *ns,
+				      struct mlx5_modify_hdr *modify_hdr);
 };
 
 int mlx5_cmd_fc_alloc(struct mlx5_core_dev *dev, u32 *id);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 7bdec442f0ac..1d2333fd3080 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1415,7 +1415,8 @@ static bool mlx5_flow_dests_cmp(struct mlx5_flow_destination *d1,
 		     ((d1->vport.flags & MLX5_FLOW_DEST_VPORT_VHCA_ID) ?
 		      (d1->vport.vhca_id == d2->vport.vhca_id) : true) &&
 		     ((d1->vport.flags & MLX5_FLOW_DEST_VPORT_REFORMAT_ID) ?
-		      (d1->vport.reformat_id == d2->vport.reformat_id) : true)) ||
+		      (d1->vport.pkt_reformat->id ==
+		       d2->vport.pkt_reformat->id) : true)) ||
 		    (d1->type == MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE &&
 		     d1->ft == d2->ft) ||
 		    (d1->type == MLX5_FLOW_DESTINATION_TYPE_TIR &&
@@ -2888,3 +2889,105 @@ int mlx5_fs_remove_rx_underlay_qpn(struct mlx5_core_dev *dev, u32 underlay_qpn)
 	return err;
 }
 EXPORT_SYMBOL(mlx5_fs_remove_rx_underlay_qpn);
+
+static struct mlx5_flow_root_namespace
+*get_root_namespace(struct mlx5_core_dev *dev, enum mlx5_flow_namespace_type ns_type)
+{
+	struct mlx5_flow_namespace *ns;
+
+	if (ns_type == MLX5_FLOW_NAMESPACE_ESW_EGRESS ||
+	    ns_type == MLX5_FLOW_NAMESPACE_ESW_INGRESS)
+		ns = mlx5_get_flow_vport_acl_namespace(dev, ns_type, 0);
+	else
+		ns = mlx5_get_flow_namespace(dev, ns_type);
+	if (!ns)
+		return NULL;
+
+	return find_root(&ns->node);
+}
+
+struct mlx5_modify_hdr *mlx5_modify_header_alloc(struct mlx5_core_dev *dev,
+						 u8 ns_type, u8 num_actions,
+						 void *modify_actions)
+{
+	struct mlx5_flow_root_namespace *root;
+	struct mlx5_modify_hdr *modify_hdr;
+	int err;
+
+	root = get_root_namespace(dev, ns_type);
+	if (!root)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	modify_hdr = kzalloc(sizeof(*modify_hdr), GFP_KERNEL);
+	if (!modify_hdr)
+		return ERR_PTR(-ENOMEM);
+
+	modify_hdr->ns_type = ns_type;
+	err = root->cmds->modify_header_alloc(root, ns_type, num_actions,
+					      modify_actions, modify_hdr);
+	if (err) {
+		kfree(modify_hdr);
+		return ERR_PTR(err);
+	}
+
+	return modify_hdr;
+}
+EXPORT_SYMBOL(mlx5_modify_header_alloc);
+
+void mlx5_modify_header_dealloc(struct mlx5_core_dev *dev,
+				struct mlx5_modify_hdr *modify_hdr)
+{
+	struct mlx5_flow_root_namespace *root;
+
+	root = get_root_namespace(dev, modify_hdr->ns_type);
+	if (WARN_ON(!root))
+		return;
+	root->cmds->modify_header_dealloc(root, modify_hdr);
+	kfree(modify_hdr);
+}
+EXPORT_SYMBOL(mlx5_modify_header_dealloc);
+
+struct mlx5_pkt_reformat *mlx5_packet_reformat_alloc(struct mlx5_core_dev *dev,
+						     int reformat_type,
+						     size_t size,
+						     void *reformat_data,
+						     enum mlx5_flow_namespace_type ns_type)
+{
+	struct mlx5_pkt_reformat *pkt_reformat;
+	struct mlx5_flow_root_namespace *root;
+	int err;
+
+	root = get_root_namespace(dev, ns_type);
+	if (!root)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	pkt_reformat = kzalloc(sizeof(*pkt_reformat), GFP_KERNEL);
+	if (!pkt_reformat)
+		return ERR_PTR(-ENOMEM);
+
+	pkt_reformat->ns_type = ns_type;
+	pkt_reformat->reformat_type = reformat_type;
+	err = root->cmds->packet_reformat_alloc(root, reformat_type, size,
+						reformat_data, ns_type,
+						pkt_reformat);
+	if (err) {
+		kfree(pkt_reformat);
+		return ERR_PTR(err);
+	}
+
+	return pkt_reformat;
+}
+EXPORT_SYMBOL(mlx5_packet_reformat_alloc);
+
+void mlx5_packet_reformat_dealloc(struct mlx5_core_dev *dev,
+				  struct mlx5_pkt_reformat *pkt_reformat)
+{
+	struct mlx5_flow_root_namespace *root;
+
+	root = get_root_namespace(dev, pkt_reformat->ns_type);
+	if (WARN_ON(!root))
+		return;
+	root->cmds->packet_reformat_dealloc(root, pkt_reformat);
+	kfree(pkt_reformat);
+}
+EXPORT_SYMBOL(mlx5_packet_reformat_dealloc);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
index 0d16b4b5ab83..ea0f221685ab 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
@@ -38,6 +38,17 @@
 #include <linux/rhashtable.h>
 #include <linux/llist.h>
 
+struct mlx5_modify_hdr {
+	enum mlx5_flow_namespace_type ns_type;
+	u32 id;
+};
+
+struct mlx5_pkt_reformat {
+	enum mlx5_flow_namespace_type ns_type;
+	int reformat_type; /* from mlx5_ifc */
+	u32 id;
+};
+
 /* FS_TYPE_PRIO_CHAINS is a PRIO that will have namespaces only,
  * and those are in parallel to one another when going over them to connect
  * a new flow table. Meaning the last flow table in a TYPE_PRIO prio in one
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index 97ec6be62ac4..724d276ea133 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -84,6 +84,8 @@ enum {
 	FDB_SLOW_PATH,
 };
 
+struct mlx5_pkt_reformat;
+struct mlx5_modify_hdr;
 struct mlx5_flow_table;
 struct mlx5_flow_group;
 struct mlx5_flow_namespace;
@@ -121,7 +123,7 @@ struct mlx5_flow_destination {
 		struct {
 			u16		num;
 			u16		vhca_id;
-			u32		reformat_id;
+			struct mlx5_pkt_reformat *pkt_reformat;
 			u8		flags;
 		} vport;
 	};
@@ -195,8 +197,8 @@ enum {
 
 struct mlx5_flow_act {
 	u32 action;
-	u32 reformat_id;
-	u32 modify_id;
+	struct mlx5_modify_hdr  *modify_hdr;
+	struct mlx5_pkt_reformat *pkt_reformat;
 	uintptr_t esp_id;
 	u32 flags;
 	struct mlx5_fs_vlan vlan[MLX5_FS_VLAN_DEPTH];
@@ -205,8 +207,6 @@ struct mlx5_flow_act {
 
 #define MLX5_DECLARE_FLOW_ACT(name) \
 	struct mlx5_flow_act name = { .action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,\
-				      .reformat_id = 0, \
-				      .modify_id = 0, \
 				      .flags =  0, }
 
 /* Single destination per rule.
@@ -236,19 +236,18 @@ u32 mlx5_fc_id(struct mlx5_fc *counter);
 int mlx5_fs_add_rx_underlay_qpn(struct mlx5_core_dev *dev, u32 underlay_qpn);
 int mlx5_fs_remove_rx_underlay_qpn(struct mlx5_core_dev *dev, u32 underlay_qpn);
 
-int mlx5_modify_header_alloc(struct mlx5_core_dev *dev,
-			     u8 namespace, u8 num_actions,
-			     void *modify_actions, u32 *modify_header_id);
+struct mlx5_modify_hdr *mlx5_modify_header_alloc(struct mlx5_core_dev *dev,
+						 u8 ns_type, u8 num_actions,
+						 void *modify_actions);
 void mlx5_modify_header_dealloc(struct mlx5_core_dev *dev,
-				u32 modify_header_id);
-
-int mlx5_packet_reformat_alloc(struct mlx5_core_dev *dev,
-			       int reformat_type,
-			       size_t size,
-			       void *reformat_data,
-			       enum mlx5_flow_namespace_type namespace,
-			       u32 *packet_reformat_id);
+				struct mlx5_modify_hdr *modify_hdr);
+
+struct mlx5_pkt_reformat *mlx5_packet_reformat_alloc(struct mlx5_core_dev *dev,
+						     int reformat_type,
+						     size_t size,
+						     void *reformat_data,
+						     enum mlx5_flow_namespace_type ns_type);
 void mlx5_packet_reformat_dealloc(struct mlx5_core_dev *dev,
-				  u32 packet_reformat_id);
+				  struct mlx5_pkt_reformat *reformat);
 
 #endif
-- 
2.21.0


^ permalink raw reply related

* [net-next 02/18] net/mlx5: DR, Add the internal direct rule types definitions
From: Saeed Mahameed @ 2019-09-02  7:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Alex Vesker, Erez Shitrit,
	Yevgeny Kliteynik, Mark Bloch, Saeed Mahameed
In-Reply-To: <20190902072213.7683-1-saeedm@mellanox.com>

From: Alex Vesker <valex@mellanox.com>

Add the internal header file that contains various types
definition that will be used in coming patches as well as
the internal functions decelerations.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../mellanox/mlx5/core/steering/dr_types.h    | 1060 +++++++++++++++++
 1 file changed, 1060 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_types.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_types.h b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_types.h
new file mode 100644
index 000000000000..a37ee6359be2
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_types.h
@@ -0,0 +1,1060 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2019, Mellanox Technologies */
+
+#ifndef	_DR_TYPES_
+#define	_DR_TYPES_
+
+#include <linux/mlx5/driver.h>
+#include <linux/refcount.h>
+#include "fs_core.h"
+#include "wq.h"
+#include "lib/mlx5.h"
+#include "mlx5_ifc_dr.h"
+#include "mlx5dr.h"
+
+#define DR_RULE_MAX_STES 17
+#define DR_ACTION_MAX_STES 5
+#define WIRE_PORT 0xFFFF
+#define DR_STE_SVLAN 0x1
+#define DR_STE_CVLAN 0x2
+
+#define mlx5dr_err(dmn, arg...) mlx5_core_err((dmn)->mdev, ##arg)
+#define mlx5dr_info(dmn, arg...) mlx5_core_info((dmn)->mdev, ##arg)
+#define mlx5dr_dbg(dmn, arg...) mlx5_core_dbg((dmn)->mdev, ##arg)
+
+enum mlx5dr_icm_chunk_size {
+	DR_CHUNK_SIZE_1,
+	DR_CHUNK_SIZE_MIN = DR_CHUNK_SIZE_1, /* keep updated when changing */
+	DR_CHUNK_SIZE_2,
+	DR_CHUNK_SIZE_4,
+	DR_CHUNK_SIZE_8,
+	DR_CHUNK_SIZE_16,
+	DR_CHUNK_SIZE_32,
+	DR_CHUNK_SIZE_64,
+	DR_CHUNK_SIZE_128,
+	DR_CHUNK_SIZE_256,
+	DR_CHUNK_SIZE_512,
+	DR_CHUNK_SIZE_1K,
+	DR_CHUNK_SIZE_2K,
+	DR_CHUNK_SIZE_4K,
+	DR_CHUNK_SIZE_8K,
+	DR_CHUNK_SIZE_16K,
+	DR_CHUNK_SIZE_32K,
+	DR_CHUNK_SIZE_64K,
+	DR_CHUNK_SIZE_128K,
+	DR_CHUNK_SIZE_256K,
+	DR_CHUNK_SIZE_512K,
+	DR_CHUNK_SIZE_1024K,
+	DR_CHUNK_SIZE_2048K,
+	DR_CHUNK_SIZE_MAX,
+};
+
+enum mlx5dr_icm_type {
+	DR_ICM_TYPE_STE,
+	DR_ICM_TYPE_MODIFY_ACTION,
+};
+
+static inline enum mlx5dr_icm_chunk_size
+mlx5dr_icm_next_higher_chunk(enum mlx5dr_icm_chunk_size chunk)
+{
+	chunk += 2;
+	if (chunk < DR_CHUNK_SIZE_MAX)
+		return chunk;
+
+	return DR_CHUNK_SIZE_MAX;
+}
+
+enum {
+	DR_STE_SIZE = 64,
+	DR_STE_SIZE_CTRL = 32,
+	DR_STE_SIZE_TAG = 16,
+	DR_STE_SIZE_MASK = 16,
+};
+
+enum {
+	DR_STE_SIZE_REDUCED = DR_STE_SIZE - DR_STE_SIZE_MASK,
+};
+
+enum {
+	DR_MODIFY_ACTION_SIZE = 8,
+};
+
+enum mlx5dr_matcher_criteria {
+	DR_MATCHER_CRITERIA_EMPTY = 0,
+	DR_MATCHER_CRITERIA_OUTER = 1 << 0,
+	DR_MATCHER_CRITERIA_MISC = 1 << 1,
+	DR_MATCHER_CRITERIA_INNER = 1 << 2,
+	DR_MATCHER_CRITERIA_MISC2 = 1 << 3,
+	DR_MATCHER_CRITERIA_MISC3 = 1 << 4,
+	DR_MATCHER_CRITERIA_MAX = 1 << 5,
+};
+
+enum mlx5dr_action_type {
+	DR_ACTION_TYP_TNL_L2_TO_L2,
+	DR_ACTION_TYP_L2_TO_TNL_L2,
+	DR_ACTION_TYP_TNL_L3_TO_L2,
+	DR_ACTION_TYP_L2_TO_TNL_L3,
+	DR_ACTION_TYP_DROP,
+	DR_ACTION_TYP_QP,
+	DR_ACTION_TYP_FT,
+	DR_ACTION_TYP_CTR,
+	DR_ACTION_TYP_TAG,
+	DR_ACTION_TYP_MODIFY_HDR,
+	DR_ACTION_TYP_VPORT,
+	DR_ACTION_TYP_POP_VLAN,
+	DR_ACTION_TYP_PUSH_VLAN,
+	DR_ACTION_TYP_MAX,
+};
+
+struct mlx5dr_icm_pool;
+struct mlx5dr_icm_chunk;
+struct mlx5dr_icm_bucket;
+struct mlx5dr_ste_htbl;
+struct mlx5dr_match_param;
+struct mlx5dr_cmd_caps;
+struct mlx5dr_matcher_rx_tx;
+
+struct mlx5dr_ste {
+	u8 *hw_ste;
+	/* refcount: indicates the num of rules that using this ste */
+	refcount_t refcount;
+
+	/* attached to the miss_list head at each htbl entry */
+	struct list_head miss_list_node;
+
+	/* each rule member that uses this ste attached here */
+	struct list_head rule_list;
+
+	/* this ste is member of htbl */
+	struct mlx5dr_ste_htbl *htbl;
+
+	struct mlx5dr_ste_htbl *next_htbl;
+
+	/* this ste is part of a rule, located in ste's chain */
+	u8 ste_chain_location;
+};
+
+struct mlx5dr_ste_htbl_ctrl {
+	/* total number of valid entries belonging to this hash table. This
+	 * includes the non collision and collision entries
+	 */
+	unsigned int num_of_valid_entries;
+
+	/* total number of collisions entries attached to this table */
+	unsigned int num_of_collisions;
+	unsigned int increase_threshold;
+	u8 may_grow:1;
+};
+
+struct mlx5dr_ste_htbl {
+	u8 lu_type;
+	u16 byte_mask;
+	refcount_t refcount;
+	struct mlx5dr_icm_chunk *chunk;
+	struct mlx5dr_ste *ste_arr;
+	u8 *hw_ste_arr;
+
+	struct list_head *miss_list;
+
+	enum mlx5dr_icm_chunk_size chunk_size;
+	struct mlx5dr_ste *pointing_ste;
+
+	struct mlx5dr_ste_htbl_ctrl ctrl;
+};
+
+struct mlx5dr_ste_send_info {
+	struct mlx5dr_ste *ste;
+	struct list_head send_list;
+	u16 size;
+	u16 offset;
+	u8 data_cont[DR_STE_SIZE];
+	u8 *data;
+};
+
+void mlx5dr_send_fill_and_append_ste_send_info(struct mlx5dr_ste *ste, u16 size,
+					       u16 offset, u8 *data,
+					       struct mlx5dr_ste_send_info *ste_info,
+					       struct list_head *send_list,
+					       bool copy_data);
+
+struct mlx5dr_ste_build {
+	u8 inner:1;
+	u8 rx:1;
+	struct mlx5dr_cmd_caps *caps;
+	u8 lu_type;
+	u16 byte_mask;
+	u8 bit_mask[DR_STE_SIZE_MASK];
+	int (*ste_build_tag_func)(struct mlx5dr_match_param *spec,
+				  struct mlx5dr_ste_build *sb,
+				  u8 *hw_ste_p);
+};
+
+struct mlx5dr_ste_htbl *
+mlx5dr_ste_htbl_alloc(struct mlx5dr_icm_pool *pool,
+		      enum mlx5dr_icm_chunk_size chunk_size,
+		      u8 lu_type, u16 byte_mask);
+
+int mlx5dr_ste_htbl_free(struct mlx5dr_ste_htbl *htbl);
+
+static inline void mlx5dr_htbl_put(struct mlx5dr_ste_htbl *htbl)
+{
+	if (refcount_dec_and_test(&htbl->refcount))
+		mlx5dr_ste_htbl_free(htbl);
+}
+
+static inline void mlx5dr_htbl_get(struct mlx5dr_ste_htbl *htbl)
+{
+	refcount_inc(&htbl->refcount);
+}
+
+/* STE utils */
+u32 mlx5dr_ste_calc_hash_index(u8 *hw_ste_p, struct mlx5dr_ste_htbl *htbl);
+void mlx5dr_ste_init(u8 *hw_ste_p, u8 lu_type, u8 entry_type, u16 gvmi);
+void mlx5dr_ste_always_hit_htbl(struct mlx5dr_ste *ste,
+				struct mlx5dr_ste_htbl *next_htbl);
+void mlx5dr_ste_set_miss_addr(u8 *hw_ste, u64 miss_addr);
+u64 mlx5dr_ste_get_miss_addr(u8 *hw_ste);
+void mlx5dr_ste_set_hit_gvmi(u8 *hw_ste_p, u16 gvmi);
+void mlx5dr_ste_set_hit_addr(u8 *hw_ste, u64 icm_addr, u32 ht_size);
+void mlx5dr_ste_always_miss_addr(struct mlx5dr_ste *ste, u64 miss_addr);
+void mlx5dr_ste_set_bit_mask(u8 *hw_ste_p, u8 *bit_mask);
+bool mlx5dr_ste_not_used_ste(struct mlx5dr_ste *ste);
+bool mlx5dr_ste_is_last_in_rule(struct mlx5dr_matcher_rx_tx *nic_matcher,
+				u8 ste_location);
+void mlx5dr_ste_rx_set_flow_tag(u8 *hw_ste_p, u32 flow_tag);
+void mlx5dr_ste_set_counter_id(u8 *hw_ste_p, u32 ctr_id);
+void mlx5dr_ste_set_tx_encap(void *hw_ste_p, u32 reformat_id,
+			     int size, bool encap_l3);
+void mlx5dr_ste_set_rx_decap(u8 *hw_ste_p);
+void mlx5dr_ste_set_rx_decap_l3(u8 *hw_ste_p, bool vlan);
+void mlx5dr_ste_set_rx_pop_vlan(u8 *hw_ste_p);
+void mlx5dr_ste_set_tx_push_vlan(u8 *hw_ste_p, u32 vlan_tpid_pcp_dei_vid,
+				 bool go_back);
+void mlx5dr_ste_set_entry_type(u8 *hw_ste_p, u8 entry_type);
+u8 mlx5dr_ste_get_entry_type(u8 *hw_ste_p);
+void mlx5dr_ste_set_rewrite_actions(u8 *hw_ste_p, u16 num_of_actions,
+				    u32 re_write_index);
+void mlx5dr_ste_set_go_back_bit(u8 *hw_ste_p);
+u64 mlx5dr_ste_get_icm_addr(struct mlx5dr_ste *ste);
+u64 mlx5dr_ste_get_mr_addr(struct mlx5dr_ste *ste);
+struct list_head *mlx5dr_ste_get_miss_list(struct mlx5dr_ste *ste);
+
+void mlx5dr_ste_free(struct mlx5dr_ste *ste,
+		     struct mlx5dr_matcher *matcher,
+		     struct mlx5dr_matcher_rx_tx *nic_matcher);
+static inline void mlx5dr_ste_put(struct mlx5dr_ste *ste,
+				  struct mlx5dr_matcher *matcher,
+				  struct mlx5dr_matcher_rx_tx *nic_matcher)
+{
+	if (refcount_dec_and_test(&ste->refcount))
+		mlx5dr_ste_free(ste, matcher, nic_matcher);
+}
+
+/* initial as 0, increased only when ste appears in a new rule */
+static inline void mlx5dr_ste_get(struct mlx5dr_ste *ste)
+{
+	refcount_inc(&ste->refcount);
+}
+
+void mlx5dr_ste_set_hit_addr_by_next_htbl(u8 *hw_ste,
+					  struct mlx5dr_ste_htbl *next_htbl);
+bool mlx5dr_ste_equal_tag(void *src, void *dst);
+int mlx5dr_ste_create_next_htbl(struct mlx5dr_matcher *matcher,
+				struct mlx5dr_matcher_rx_tx *nic_matcher,
+				struct mlx5dr_ste *ste,
+				u8 *cur_hw_ste,
+				enum mlx5dr_icm_chunk_size log_table_size);
+
+/* STE build functions */
+int mlx5dr_ste_build_pre_check(struct mlx5dr_domain *dmn,
+			       u8 match_criteria,
+			       struct mlx5dr_match_param *mask,
+			       struct mlx5dr_match_param *value);
+int mlx5dr_ste_build_ste_arr(struct mlx5dr_matcher *matcher,
+			     struct mlx5dr_matcher_rx_tx *nic_matcher,
+			     struct mlx5dr_match_param *value,
+			     u8 *ste_arr);
+int mlx5dr_ste_build_eth_l2_src_des(struct mlx5dr_ste_build *builder,
+				    struct mlx5dr_match_param *mask,
+				    bool inner, bool rx);
+void mlx5dr_ste_build_eth_l3_ipv4_5_tuple(struct mlx5dr_ste_build *sb,
+					  struct mlx5dr_match_param *mask,
+					  bool inner, bool rx);
+void mlx5dr_ste_build_eth_l3_ipv4_misc(struct mlx5dr_ste_build *sb,
+				       struct mlx5dr_match_param *mask,
+				       bool inner, bool rx);
+void mlx5dr_ste_build_eth_l3_ipv6_dst(struct mlx5dr_ste_build *sb,
+				      struct mlx5dr_match_param *mask,
+				      bool inner, bool rx);
+void mlx5dr_ste_build_eth_l3_ipv6_src(struct mlx5dr_ste_build *sb,
+				      struct mlx5dr_match_param *mask,
+				      bool inner, bool rx);
+void mlx5dr_ste_build_eth_l2_src(struct mlx5dr_ste_build *sb,
+				 struct mlx5dr_match_param *mask,
+				 bool inner, bool rx);
+void mlx5dr_ste_build_eth_l2_dst(struct mlx5dr_ste_build *sb,
+				 struct mlx5dr_match_param *mask,
+				 bool inner, bool rx);
+void mlx5dr_ste_build_eth_l2_tnl(struct mlx5dr_ste_build *sb,
+				 struct mlx5dr_match_param *mask,
+				 bool inner, bool rx);
+void mlx5dr_ste_build_ipv6_l3_l4(struct mlx5dr_ste_build *sb,
+				 struct mlx5dr_match_param *mask,
+				 bool inner, bool rx);
+void mlx5dr_ste_build_eth_l4_misc(struct mlx5dr_ste_build *sb,
+				  struct mlx5dr_match_param *mask,
+				  bool inner, bool rx);
+void mlx5dr_ste_build_gre(struct mlx5dr_ste_build *sb,
+			  struct mlx5dr_match_param *mask,
+			  bool inner, bool rx);
+void mlx5dr_ste_build_mpls(struct mlx5dr_ste_build *sb,
+			   struct mlx5dr_match_param *mask,
+			   bool inner, bool rx);
+void mlx5dr_ste_build_flex_parser_0(struct mlx5dr_ste_build *sb,
+				    struct mlx5dr_match_param *mask,
+				    bool inner, bool rx);
+int mlx5dr_ste_build_flex_parser_1(struct mlx5dr_ste_build *sb,
+				   struct mlx5dr_match_param *mask,
+				   struct mlx5dr_cmd_caps *caps,
+				   bool inner, bool rx);
+void mlx5dr_ste_build_flex_parser_tnl(struct mlx5dr_ste_build *sb,
+				      struct mlx5dr_match_param *mask,
+				      bool inner, bool rx);
+void mlx5dr_ste_build_general_purpose(struct mlx5dr_ste_build *sb,
+				      struct mlx5dr_match_param *mask,
+				      bool inner, bool rx);
+void mlx5dr_ste_build_register_0(struct mlx5dr_ste_build *sb,
+				 struct mlx5dr_match_param *mask,
+				 bool inner, bool rx);
+void mlx5dr_ste_build_register_1(struct mlx5dr_ste_build *sb,
+				 struct mlx5dr_match_param *mask,
+				 bool inner, bool rx);
+int mlx5dr_ste_build_src_gvmi_qpn(struct mlx5dr_ste_build *sb,
+				  struct mlx5dr_match_param *mask,
+				  struct mlx5dr_cmd_caps *caps,
+				  bool inner, bool rx);
+void mlx5dr_ste_build_empty_always_hit(struct mlx5dr_ste_build *sb, bool rx);
+
+/* Actions utils */
+int mlx5dr_actions_build_ste_arr(struct mlx5dr_matcher *matcher,
+				 struct mlx5dr_matcher_rx_tx *nic_matcher,
+				 struct mlx5dr_action *actions[],
+				 u32 num_actions,
+				 u8 *ste_arr,
+				 u32 *new_hw_ste_arr_sz);
+
+struct mlx5dr_match_spec {
+	u32 smac_47_16;		/* Source MAC address of incoming packet */
+	/* Incoming packet Ethertype - this is the Ethertype
+	 * following the last VLAN tag of the packet
+	 */
+	u32 ethertype:16;
+	u32 smac_15_0:16;	/* Source MAC address of incoming packet */
+	u32 dmac_47_16;		/* Destination MAC address of incoming packet */
+	/* VLAN ID of first VLAN tag in the incoming packet.
+	 * Valid only when cvlan_tag==1 or svlan_tag==1
+	 */
+	u32 first_vid:12;
+	/* CFI bit of first VLAN tag in the incoming packet.
+	 * Valid only when cvlan_tag==1 or svlan_tag==1
+	 */
+	u32 first_cfi:1;
+	/* Priority of first VLAN tag in the incoming packet.
+	 * Valid only when cvlan_tag==1 or svlan_tag==1
+	 */
+	u32 first_prio:3;
+	u32 dmac_15_0:16;	/* Destination MAC address of incoming packet */
+	/* TCP flags. ;Bit 0: FIN;Bit 1: SYN;Bit 2: RST;Bit 3: PSH;Bit 4: ACK;
+	 *             Bit 5: URG;Bit 6: ECE;Bit 7: CWR;Bit 8: NS
+	 */
+	u32 tcp_flags:9;
+	u32 ip_version:4;	/* IP version */
+	u32 frag:1;		/* Packet is an IP fragment */
+	/* The first vlan in the packet is s-vlan (0x8a88).
+	 * cvlan_tag and svlan_tag cannot be set together
+	 */
+	u32 svlan_tag:1;
+	/* The first vlan in the packet is c-vlan (0x8100).
+	 * cvlan_tag and svlan_tag cannot be set together
+	 */
+	u32 cvlan_tag:1;
+	/* Explicit Congestion Notification derived from
+	 * Traffic Class/TOS field of IPv6/v4
+	 */
+	u32 ip_ecn:2;
+	/* Differentiated Services Code Point derived from
+	 * Traffic Class/TOS field of IPv6/v4
+	 */
+	u32 ip_dscp:6;
+	u32 ip_protocol:8;	/* IP protocol */
+	/* TCP destination port.
+	 * tcp and udp sport/dport are mutually exclusive
+	 */
+	u32 tcp_dport:16;
+	/* TCP source port.;tcp and udp sport/dport are mutually exclusive */
+	u32 tcp_sport:16;
+	u32 ttl_hoplimit:8;
+	u32 reserved:24;
+	/* UDP destination port.;tcp and udp sport/dport are mutually exclusive */
+	u32 udp_dport:16;
+	/* UDP source port.;tcp and udp sport/dport are mutually exclusive */
+	u32 udp_sport:16;
+	/* IPv6 source address of incoming packets
+	 * For IPv4 address use bits 31:0 (rest of the bits are reserved)
+	 * This field should be qualified by an appropriate ethertype
+	 */
+	u32 src_ip_127_96;
+	/* IPv6 source address of incoming packets
+	 * For IPv4 address use bits 31:0 (rest of the bits are reserved)
+	 * This field should be qualified by an appropriate ethertype
+	 */
+	u32 src_ip_95_64;
+	/* IPv6 source address of incoming packets
+	 * For IPv4 address use bits 31:0 (rest of the bits are reserved)
+	 * This field should be qualified by an appropriate ethertype
+	 */
+	u32 src_ip_63_32;
+	/* IPv6 source address of incoming packets
+	 * For IPv4 address use bits 31:0 (rest of the bits are reserved)
+	 * This field should be qualified by an appropriate ethertype
+	 */
+	u32 src_ip_31_0;
+	/* IPv6 destination address of incoming packets
+	 * For IPv4 address use bits 31:0 (rest of the bits are reserved)
+	 * This field should be qualified by an appropriate ethertype
+	 */
+	u32 dst_ip_127_96;
+	/* IPv6 destination address of incoming packets
+	 * For IPv4 address use bits 31:0 (rest of the bits are reserved)
+	 * This field should be qualified by an appropriate ethertype
+	 */
+	u32 dst_ip_95_64;
+	/* IPv6 destination address of incoming packets
+	 * For IPv4 address use bits 31:0 (rest of the bits are reserved)
+	 * This field should be qualified by an appropriate ethertype
+	 */
+	u32 dst_ip_63_32;
+	/* IPv6 destination address of incoming packets
+	 * For IPv4 address use bits 31:0 (rest of the bits are reserved)
+	 * This field should be qualified by an appropriate ethertype
+	 */
+	u32 dst_ip_31_0;
+};
+
+struct mlx5dr_match_misc {
+	u32 source_sqn:24;		/* Source SQN */
+	u32 source_vhca_port:4;
+	/* used with GRE, sequence number exist when gre_s_present == 1 */
+	u32 gre_s_present:1;
+	/* used with GRE, key exist when gre_k_present == 1 */
+	u32 gre_k_present:1;
+	u32 reserved_auto1:1;
+	/* used with GRE, checksum exist when gre_c_present == 1 */
+	u32 gre_c_present:1;
+	/* Source port.;0xffff determines wire port */
+	u32 source_port:16;
+	u32 reserved_auto2:16;
+	/* VLAN ID of first VLAN tag the inner header of the incoming packet.
+	 * Valid only when inner_second_cvlan_tag ==1 or inner_second_svlan_tag ==1
+	 */
+	u32 inner_second_vid:12;
+	/* CFI bit of first VLAN tag in the inner header of the incoming packet.
+	 * Valid only when inner_second_cvlan_tag ==1 or inner_second_svlan_tag ==1
+	 */
+	u32 inner_second_cfi:1;
+	/* Priority of second VLAN tag in the inner header of the incoming packet.
+	 * Valid only when inner_second_cvlan_tag ==1 or inner_second_svlan_tag ==1
+	 */
+	u32 inner_second_prio:3;
+	/* VLAN ID of first VLAN tag the outer header of the incoming packet.
+	 * Valid only when outer_second_cvlan_tag ==1 or outer_second_svlan_tag ==1
+	 */
+	u32 outer_second_vid:12;
+	/* CFI bit of first VLAN tag in the outer header of the incoming packet.
+	 * Valid only when outer_second_cvlan_tag ==1 or outer_second_svlan_tag ==1
+	 */
+	u32 outer_second_cfi:1;
+	/* Priority of second VLAN tag in the outer header of the incoming packet.
+	 * Valid only when outer_second_cvlan_tag ==1 or outer_second_svlan_tag ==1
+	 */
+	u32 outer_second_prio:3;
+	u32 gre_protocol:16;		/* GRE Protocol (outer) */
+	u32 reserved_auto3:12;
+	/* The second vlan in the inner header of the packet is s-vlan (0x8a88).
+	 * inner_second_cvlan_tag and inner_second_svlan_tag cannot be set together
+	 */
+	u32 inner_second_svlan_tag:1;
+	/* The second vlan in the outer header of the packet is s-vlan (0x8a88).
+	 * outer_second_cvlan_tag and outer_second_svlan_tag cannot be set together
+	 */
+	u32 outer_second_svlan_tag:1;
+	/* The second vlan in the inner header of the packet is c-vlan (0x8100).
+	 * inner_second_cvlan_tag and inner_second_svlan_tag cannot be set together
+	 */
+	u32 inner_second_cvlan_tag:1;
+	/* The second vlan in the outer header of the packet is c-vlan (0x8100).
+	 * outer_second_cvlan_tag and outer_second_svlan_tag cannot be set together
+	 */
+	u32 outer_second_cvlan_tag:1;
+	u32 gre_key_l:8;		/* GRE Key [7:0] (outer) */
+	u32 gre_key_h:24;		/* GRE Key[31:8] (outer) */
+	u32 reserved_auto4:8;
+	u32 vxlan_vni:24;		/* VXLAN VNI (outer) */
+	u32 geneve_oam:1;		/* GENEVE OAM field (outer) */
+	u32 reserved_auto5:7;
+	u32 geneve_vni:24;		/* GENEVE VNI field (outer) */
+	u32 outer_ipv6_flow_label:20;	/* Flow label of incoming IPv6 packet (outer) */
+	u32 reserved_auto6:12;
+	u32 inner_ipv6_flow_label:20;	/* Flow label of incoming IPv6 packet (inner) */
+	u32 reserved_auto7:12;
+	u32 geneve_protocol_type:16;	/* GENEVE protocol type (outer) */
+	u32 geneve_opt_len:6;		/* GENEVE OptLen (outer) */
+	u32 reserved_auto8:10;
+	u32 bth_dst_qp:24;		/* Destination QP in BTH header */
+	u32 reserved_auto9:8;
+	u8 reserved_auto10[20];
+};
+
+struct mlx5dr_match_misc2 {
+	u32 outer_first_mpls_ttl:8;		/* First MPLS TTL (outer) */
+	u32 outer_first_mpls_s_bos:1;		/* First MPLS S_BOS (outer) */
+	u32 outer_first_mpls_exp:3;		/* First MPLS EXP (outer) */
+	u32 outer_first_mpls_label:20;		/* First MPLS LABEL (outer) */
+	u32 inner_first_mpls_ttl:8;		/* First MPLS TTL (inner) */
+	u32 inner_first_mpls_s_bos:1;		/* First MPLS S_BOS (inner) */
+	u32 inner_first_mpls_exp:3;		/* First MPLS EXP (inner) */
+	u32 inner_first_mpls_label:20;		/* First MPLS LABEL (inner) */
+	u32 outer_first_mpls_over_gre_ttl:8;	/* last MPLS TTL (outer) */
+	u32 outer_first_mpls_over_gre_s_bos:1;	/* last MPLS S_BOS (outer) */
+	u32 outer_first_mpls_over_gre_exp:3;	/* last MPLS EXP (outer) */
+	u32 outer_first_mpls_over_gre_label:20;	/* last MPLS LABEL (outer) */
+	u32 outer_first_mpls_over_udp_ttl:8;	/* last MPLS TTL (outer) */
+	u32 outer_first_mpls_over_udp_s_bos:1;	/* last MPLS S_BOS (outer) */
+	u32 outer_first_mpls_over_udp_exp:3;	/* last MPLS EXP (outer) */
+	u32 outer_first_mpls_over_udp_label:20;	/* last MPLS LABEL (outer) */
+	u32 metadata_reg_c_7;			/* metadata_reg_c_7 */
+	u32 metadata_reg_c_6;			/* metadata_reg_c_6 */
+	u32 metadata_reg_c_5;			/* metadata_reg_c_5 */
+	u32 metadata_reg_c_4;			/* metadata_reg_c_4 */
+	u32 metadata_reg_c_3;			/* metadata_reg_c_3 */
+	u32 metadata_reg_c_2;			/* metadata_reg_c_2 */
+	u32 metadata_reg_c_1;			/* metadata_reg_c_1 */
+	u32 metadata_reg_c_0;			/* metadata_reg_c_0 */
+	u32 metadata_reg_a;			/* metadata_reg_a */
+	u32 metadata_reg_b;			/* metadata_reg_b */
+	u8 reserved_auto2[8];
+};
+
+struct mlx5dr_match_misc3 {
+	u32 inner_tcp_seq_num;
+	u32 outer_tcp_seq_num;
+	u32 inner_tcp_ack_num;
+	u32 outer_tcp_ack_num;
+	u32 outer_vxlan_gpe_vni:24;
+	u32 reserved_auto1:8;
+	u32 reserved_auto2:16;
+	u32 outer_vxlan_gpe_flags:8;
+	u32 outer_vxlan_gpe_next_protocol:8;
+	u32 icmpv4_header_data;
+	u32 icmpv6_header_data;
+	u32 icmpv6_code:8;
+	u32 icmpv6_type:8;
+	u32 icmpv4_code:8;
+	u32 icmpv4_type:8;
+	u8 reserved_auto3[0x1c];
+};
+
+struct mlx5dr_match_param {
+	struct mlx5dr_match_spec outer;
+	struct mlx5dr_match_misc misc;
+	struct mlx5dr_match_spec inner;
+	struct mlx5dr_match_misc2 misc2;
+	struct mlx5dr_match_misc3 misc3;
+};
+
+#define DR_MASK_IS_FLEX_PARSER_ICMPV4_SET(_misc3) ((_misc3)->icmpv4_type || \
+						   (_misc3)->icmpv4_code || \
+						   (_misc3)->icmpv4_header_data)
+
+struct mlx5dr_esw_caps {
+	u64 drop_icm_address_rx;
+	u64 drop_icm_address_tx;
+	u64 uplink_icm_address_rx;
+	u64 uplink_icm_address_tx;
+	bool sw_owner;
+};
+
+struct mlx5dr_cmd_vport_cap {
+	u16 vport_gvmi;
+	u16 vhca_gvmi;
+	u64 icm_address_rx;
+	u64 icm_address_tx;
+	u32 num;
+};
+
+struct mlx5dr_cmd_caps {
+	u16 gvmi;
+	u64 nic_rx_drop_address;
+	u64 nic_tx_drop_address;
+	u64 nic_tx_allow_address;
+	u64 esw_rx_drop_address;
+	u64 esw_tx_drop_address;
+	u32 log_icm_size;
+	u64 hdr_modify_icm_addr;
+	u32 flex_protocols;
+	u8 flex_parser_id_icmp_dw0;
+	u8 flex_parser_id_icmp_dw1;
+	u8 flex_parser_id_icmpv6_dw0;
+	u8 flex_parser_id_icmpv6_dw1;
+	u8 max_ft_level;
+	u16 roce_min_src_udp;
+	u8 num_esw_ports;
+	bool eswitch_manager;
+	bool rx_sw_owner;
+	bool tx_sw_owner;
+	bool fdb_sw_owner;
+	u32 num_vports;
+	struct mlx5dr_esw_caps esw_caps;
+	struct mlx5dr_cmd_vport_cap *vports_caps;
+	bool prio_tag_required;
+};
+
+struct mlx5dr_domain_rx_tx {
+	u64 drop_icm_addr;
+	u64 default_icm_addr;
+	enum mlx5dr_ste_entry_type ste_type;
+};
+
+struct mlx5dr_domain_info {
+	bool supp_sw_steering;
+	u32 max_inline_size;
+	u32 max_send_wr;
+	u32 max_log_sw_icm_sz;
+	u32 max_log_action_icm_sz;
+	struct mlx5dr_domain_rx_tx rx;
+	struct mlx5dr_domain_rx_tx tx;
+	struct mlx5dr_cmd_caps caps;
+};
+
+struct mlx5dr_domain_cache {
+	struct mlx5dr_fw_recalc_cs_ft **recalc_cs_ft;
+};
+
+struct mlx5dr_domain {
+	struct mlx5dr_domain *peer_dmn;
+	struct mlx5_core_dev *mdev;
+	u32 pdn;
+	struct mlx5_uars_page *uar;
+	enum mlx5dr_domain_type type;
+	refcount_t refcount;
+	struct mutex mutex; /* protect domain */
+	struct mlx5dr_icm_pool *ste_icm_pool;
+	struct mlx5dr_icm_pool *action_icm_pool;
+	struct mlx5dr_send_ring *send_ring;
+	struct mlx5dr_domain_info info;
+	struct mlx5dr_domain_cache cache;
+};
+
+struct mlx5dr_table_rx_tx {
+	struct mlx5dr_ste_htbl *s_anchor;
+	struct mlx5dr_domain_rx_tx *nic_dmn;
+	u64 default_icm_addr;
+};
+
+struct mlx5dr_table {
+	struct mlx5dr_domain *dmn;
+	struct mlx5dr_table_rx_tx rx;
+	struct mlx5dr_table_rx_tx tx;
+	u32 level;
+	u32 table_type;
+	u32 table_id;
+	struct list_head matcher_list;
+	struct mlx5dr_action *miss_action;
+	refcount_t refcount;
+};
+
+struct mlx5dr_matcher_rx_tx {
+	struct mlx5dr_ste_htbl *s_htbl;
+	struct mlx5dr_ste_htbl *e_anchor;
+	struct mlx5dr_ste_build *ste_builder;
+	struct mlx5dr_ste_build ste_builder4[DR_RULE_MAX_STES];
+	struct mlx5dr_ste_build ste_builder6[DR_RULE_MAX_STES];
+	u8 num_of_builders;
+	u8 num_of_builders4;
+	u8 num_of_builders6;
+	u64 default_icm_addr;
+	struct mlx5dr_table_rx_tx *nic_tbl;
+};
+
+struct mlx5dr_matcher {
+	struct mlx5dr_table *tbl;
+	struct mlx5dr_matcher_rx_tx rx;
+	struct mlx5dr_matcher_rx_tx tx;
+	struct list_head matcher_list;
+	u16 prio;
+	struct mlx5dr_match_param mask;
+	u8 match_criteria;
+	refcount_t refcount;
+	struct mlx5dv_flow_matcher *dv_matcher;
+};
+
+struct mlx5dr_rule_member {
+	struct mlx5dr_ste *ste;
+	/* attached to mlx5dr_rule via this */
+	struct list_head list;
+	/* attached to mlx5dr_ste via this */
+	struct list_head use_ste_list;
+};
+
+struct mlx5dr_action {
+	enum mlx5dr_action_type action_type;
+	refcount_t refcount;
+	union {
+		struct {
+			struct mlx5dr_domain *dmn;
+			struct mlx5dr_icm_chunk *chunk;
+			u8 *data;
+			u32 data_size;
+			u16 num_of_actions;
+			u32 index;
+			u8 allow_rx:1;
+			u8 allow_tx:1;
+			u8 modify_ttl:1;
+		} rewrite;
+		struct {
+			struct mlx5dr_domain *dmn;
+			u32 reformat_id;
+			u32 reformat_size;
+		} reformat;
+		struct {
+			u8 is_fw_tbl:1;
+			union {
+				struct mlx5dr_table *tbl;
+				struct {
+					struct mlx5_flow_table *ft;
+					u64 rx_icm_addr;
+					u64 tx_icm_addr;
+					struct mlx5_core_dev *mdev;
+				} fw_tbl;
+			};
+		} dest_tbl;
+		struct {
+			u32 ctr_id;
+			u32 offeset;
+		} ctr;
+		struct {
+			struct mlx5dr_domain *dmn;
+			struct mlx5dr_cmd_vport_cap *caps;
+			u32 num;
+		} vport;
+		struct {
+			u32 vlan_hdr; /* tpid_pcp_dei_vid */
+		} push_vlan;
+		u32 flow_tag;
+	};
+};
+
+enum mlx5dr_connect_type {
+	CONNECT_HIT	= 1,
+	CONNECT_MISS	= 2,
+};
+
+struct mlx5dr_htbl_connect_info {
+	enum mlx5dr_connect_type type;
+	union {
+		struct mlx5dr_ste_htbl *hit_next_htbl;
+		u64 miss_icm_addr;
+	};
+};
+
+struct mlx5dr_rule_rx_tx {
+	struct list_head rule_members_list;
+	struct mlx5dr_matcher_rx_tx *nic_matcher;
+};
+
+struct mlx5dr_rule {
+	struct mlx5dr_matcher *matcher;
+	struct mlx5dr_rule_rx_tx rx;
+	struct mlx5dr_rule_rx_tx tx;
+	struct list_head rule_actions_list;
+};
+
+void mlx5dr_rule_update_rule_member(struct mlx5dr_ste *new_ste,
+				    struct mlx5dr_ste *ste);
+
+struct mlx5dr_icm_chunk {
+	struct mlx5dr_icm_bucket *bucket;
+	struct list_head chunk_list;
+	u32 rkey;
+	u32 num_of_entries;
+	u32 byte_size;
+	u64 icm_addr;
+	u64 mr_addr;
+
+	/* Memory optimisation */
+	struct mlx5dr_ste *ste_arr;
+	u8 *hw_ste_arr;
+	struct list_head *miss_list;
+};
+
+static inline int
+mlx5dr_matcher_supp_flex_parser_icmp_v4(struct mlx5dr_cmd_caps *caps)
+{
+	return caps->flex_protocols & MLX5_FLEX_PARSER_ICMP_V4_ENABLED;
+}
+
+static inline int
+mlx5dr_matcher_supp_flex_parser_icmp_v6(struct mlx5dr_cmd_caps *caps)
+{
+	return caps->flex_protocols & MLX5_FLEX_PARSER_ICMP_V6_ENABLED;
+}
+
+int mlx5dr_matcher_select_builders(struct mlx5dr_matcher *matcher,
+				   struct mlx5dr_matcher_rx_tx *nic_matcher,
+				   bool ipv6);
+
+static inline u32
+mlx5dr_icm_pool_chunk_size_to_entries(enum mlx5dr_icm_chunk_size chunk_size)
+{
+	return 1 << chunk_size;
+}
+
+static inline int
+mlx5dr_icm_pool_chunk_size_to_byte(enum mlx5dr_icm_chunk_size chunk_size,
+				   enum mlx5dr_icm_type icm_type)
+{
+	int num_of_entries;
+	int entry_size;
+
+	if (icm_type == DR_ICM_TYPE_STE)
+		entry_size = DR_STE_SIZE;
+	else
+		entry_size = DR_MODIFY_ACTION_SIZE;
+
+	num_of_entries = mlx5dr_icm_pool_chunk_size_to_entries(chunk_size);
+
+	return entry_size * num_of_entries;
+}
+
+static inline struct mlx5dr_cmd_vport_cap *
+mlx5dr_get_vport_cap(struct mlx5dr_cmd_caps *caps, u32 vport)
+{
+	if (!caps->vports_caps ||
+	    (vport >= caps->num_vports && vport != WIRE_PORT))
+		return NULL;
+
+	if (vport == WIRE_PORT)
+		vport = caps->num_vports;
+
+	return &caps->vports_caps[vport];
+}
+
+struct mlx5dr_cmd_query_flow_table_details {
+	u8 status;
+	u8 level;
+	u64 sw_owner_icm_root_1;
+	u64 sw_owner_icm_root_0;
+};
+
+/* internal API functions */
+int mlx5dr_cmd_query_device(struct mlx5_core_dev *mdev,
+			    struct mlx5dr_cmd_caps *caps);
+int mlx5dr_cmd_query_esw_vport_context(struct mlx5_core_dev *mdev,
+				       bool other_vport, u16 vport_number,
+				       u64 *icm_address_rx,
+				       u64 *icm_address_tx);
+int mlx5dr_cmd_query_gvmi(struct mlx5_core_dev *mdev,
+			  bool other_vport, u16 vport_number, u16 *gvmi);
+int mlx5dr_cmd_query_esw_caps(struct mlx5_core_dev *mdev,
+			      struct mlx5dr_esw_caps *caps);
+int mlx5dr_cmd_sync_steering(struct mlx5_core_dev *mdev);
+int mlx5dr_cmd_set_fte_modify_and_vport(struct mlx5_core_dev *mdev,
+					u32 table_type,
+					u32 table_id,
+					u32 group_id,
+					u32 modify_header_id,
+					u32 vport_id);
+int mlx5dr_cmd_del_flow_table_entry(struct mlx5_core_dev *mdev,
+				    u32 table_type,
+				    u32 table_id);
+int mlx5dr_cmd_alloc_modify_header(struct mlx5_core_dev *mdev,
+				   u32 table_type,
+				   u8 num_of_actions,
+				   u64 *actions,
+				   u32 *modify_header_id);
+int mlx5dr_cmd_dealloc_modify_header(struct mlx5_core_dev *mdev,
+				     u32 modify_header_id);
+int mlx5dr_cmd_create_empty_flow_group(struct mlx5_core_dev *mdev,
+				       u32 table_type,
+				       u32 table_id,
+				       u32 *group_id);
+int mlx5dr_cmd_destroy_flow_group(struct mlx5_core_dev *mdev,
+				  u32 table_type,
+				  u32 table_id,
+				  u32 group_id);
+int mlx5dr_cmd_create_flow_table(struct mlx5_core_dev *mdev,
+				 u32 table_type,
+				 u64 icm_addr_rx,
+				 u64 icm_addr_tx,
+				 u8 level,
+				 bool sw_owner,
+				 bool term_tbl,
+				 u64 *fdb_rx_icm_addr,
+				 u32 *table_id);
+int mlx5dr_cmd_destroy_flow_table(struct mlx5_core_dev *mdev,
+				  u32 table_id,
+				  u32 table_type);
+int mlx5dr_cmd_query_flow_table(struct mlx5_core_dev *dev,
+				enum fs_flow_table_type type,
+				u32 table_id,
+				struct mlx5dr_cmd_query_flow_table_details *output);
+int mlx5dr_cmd_create_reformat_ctx(struct mlx5_core_dev *mdev,
+				   enum mlx5_reformat_ctx_type rt,
+				   size_t reformat_size,
+				   void *reformat_data,
+				   u32 *reformat_id);
+void mlx5dr_cmd_destroy_reformat_ctx(struct mlx5_core_dev *mdev,
+				     u32 reformat_id);
+
+struct mlx5dr_cmd_gid_attr {
+	u8 gid[16];
+	u8 mac[6];
+	u32 roce_ver;
+};
+
+struct mlx5dr_cmd_qp_create_attr {
+	u32 page_id;
+	u32 pdn;
+	u32 cqn;
+	u32 pm_state;
+	u32 service_type;
+	u32 buff_umem_id;
+	u32 db_umem_id;
+	u32 sq_wqe_cnt;
+	u32 rq_wqe_cnt;
+	u32 rq_wqe_shift;
+};
+
+int mlx5dr_cmd_query_gid(struct mlx5_core_dev *mdev, u8 vhca_port_num,
+			 u16 index, struct mlx5dr_cmd_gid_attr *attr);
+
+struct mlx5dr_icm_pool *mlx5dr_icm_pool_create(struct mlx5dr_domain *dmn,
+					       enum mlx5dr_icm_type icm_type);
+void mlx5dr_icm_pool_destroy(struct mlx5dr_icm_pool *pool);
+
+struct mlx5dr_icm_chunk *
+mlx5dr_icm_alloc_chunk(struct mlx5dr_icm_pool *pool,
+		       enum mlx5dr_icm_chunk_size chunk_size);
+void mlx5dr_icm_free_chunk(struct mlx5dr_icm_chunk *chunk);
+bool mlx5dr_ste_is_not_valid_entry(u8 *p_hw_ste);
+int mlx5dr_ste_htbl_init_and_postsend(struct mlx5dr_domain *dmn,
+				      struct mlx5dr_domain_rx_tx *nic_dmn,
+				      struct mlx5dr_ste_htbl *htbl,
+				      struct mlx5dr_htbl_connect_info *connect_info,
+				      bool update_hw_ste);
+void mlx5dr_ste_set_formatted_ste(u16 gvmi,
+				  struct mlx5dr_domain_rx_tx *nic_dmn,
+				  struct mlx5dr_ste_htbl *htbl,
+				  u8 *formatted_ste,
+				  struct mlx5dr_htbl_connect_info *connect_info);
+void mlx5dr_ste_copy_param(u8 match_criteria,
+			   struct mlx5dr_match_param *set_param,
+			   struct mlx5dr_match_parameters *mask);
+
+void mlx5dr_crc32_init_table(void);
+u32 mlx5dr_crc32_slice8_calc(const void *input_data, size_t length);
+
+struct mlx5dr_qp {
+	struct mlx5_core_dev *mdev;
+	struct mlx5_wq_qp wq;
+	struct mlx5_uars_page *uar;
+	struct mlx5_wq_ctrl wq_ctrl;
+	struct mlx5_core_qp mqp;
+	struct {
+		unsigned int pc;
+		unsigned int cc;
+		unsigned int size;
+		unsigned int *wqe_head;
+		unsigned int wqe_cnt;
+	} sq;
+	struct {
+		unsigned int pc;
+		unsigned int cc;
+		unsigned int size;
+		unsigned int wqe_cnt;
+	} rq;
+	int max_inline_data;
+};
+
+struct mlx5dr_cq {
+	struct mlx5_core_dev *mdev;
+	struct mlx5_cqwq wq;
+	struct mlx5_wq_ctrl wq_ctrl;
+	struct mlx5_core_cq mcq;
+	struct mlx5dr_qp *qp;
+};
+
+struct mlx5dr_mr {
+	struct mlx5_core_dev *mdev;
+	struct mlx5_core_mkey mkey;
+	dma_addr_t dma_addr;
+	void *addr;
+	size_t size;
+};
+
+#define MAX_SEND_CQE		64
+#define MIN_READ_SYNC		64
+
+struct mlx5dr_send_ring {
+	struct mlx5dr_cq *cq;
+	struct mlx5dr_qp *qp;
+	struct mlx5dr_mr *mr;
+	/* How much wqes are waiting for completion */
+	u32 pending_wqe;
+	/* Signal request per this trash hold value */
+	u16 signal_th;
+	/* Each post_send_size less than max_post_send_size */
+	u32 max_post_send_size;
+	/* manage the send queue */
+	u32 tx_head;
+	void *buf;
+	u32 buf_size;
+	struct ib_wc wc[MAX_SEND_CQE];
+	u8 sync_buff[MIN_READ_SYNC];
+	struct mlx5dr_mr *sync_mr;
+};
+
+int mlx5dr_send_ring_alloc(struct mlx5dr_domain *dmn);
+void mlx5dr_send_ring_free(struct mlx5dr_domain *dmn,
+			   struct mlx5dr_send_ring *send_ring);
+int mlx5dr_send_ring_force_drain(struct mlx5dr_domain *dmn);
+int mlx5dr_send_postsend_ste(struct mlx5dr_domain *dmn,
+			     struct mlx5dr_ste *ste,
+			     u8 *data,
+			     u16 size,
+			     u16 offset);
+int mlx5dr_send_postsend_htbl(struct mlx5dr_domain *dmn,
+			      struct mlx5dr_ste_htbl *htbl,
+			      u8 *formatted_ste, u8 *mask);
+int mlx5dr_send_postsend_formatted_htbl(struct mlx5dr_domain *dmn,
+					struct mlx5dr_ste_htbl *htbl,
+					u8 *ste_init_data,
+					bool update_hw_ste);
+int mlx5dr_send_postsend_action(struct mlx5dr_domain *dmn,
+				struct mlx5dr_action *action);
+
+struct mlx5dr_fw_recalc_cs_ft {
+	u64 rx_icm_addr;
+	u32 table_id;
+	u32 group_id;
+	u32 modify_hdr_id;
+};
+
+struct mlx5dr_fw_recalc_cs_ft *
+mlx5dr_fw_create_recalc_cs_ft(struct mlx5dr_domain *dmn, u32 vport_num);
+void mlx5dr_fw_destroy_recalc_cs_ft(struct mlx5dr_domain *dmn,
+				    struct mlx5dr_fw_recalc_cs_ft *recalc_cs_ft);
+int mlx5dr_domain_cache_get_recalc_cs_ft_addr(struct mlx5dr_domain *dmn,
+					      u32 vport_num,
+					      u64 *rx_icm_addr);
+#endif  /* _DR_TYPES_H_ */
-- 
2.21.0


^ permalink raw reply related

* [net-next 05/18] net/mlx5: DR, Expose an internal API to issue RDMA operations
From: Saeed Mahameed @ 2019-09-02  7:23 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Alex Vesker, Erez Shitrit, Mark Bloch,
	Saeed Mahameed
In-Reply-To: <20190902072213.7683-1-saeedm@mellanox.com>

From: Alex Vesker <valex@mellanox.com>

Inserting or deleting a rule is done by RDMA read/write operation to SW
ICM device memory. This file provides the support for executing these
operations. It includes allocating the needed resources and providing an
API for writing steering entries to the memory.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../mellanox/mlx5/core/steering/dr_send.c     | 976 ++++++++++++++++++
 1 file changed, 976 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c
new file mode 100644
index 000000000000..ef0dea44f3b3
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c
@@ -0,0 +1,976 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2019 Mellanox Technologies. */
+
+#include "dr_types.h"
+
+#define QUEUE_SIZE 128
+#define SIGNAL_PER_DIV_QUEUE 16
+#define TH_NUMS_TO_DRAIN 2
+
+enum { CQ_OK = 0, CQ_EMPTY = -1, CQ_POLL_ERR = -2 };
+
+struct dr_data_seg {
+	u64 addr;
+	u32 length;
+	u32 lkey;
+	unsigned int send_flags;
+};
+
+struct postsend_info {
+	struct dr_data_seg write;
+	struct dr_data_seg read;
+	u64 remote_addr;
+	u32 rkey;
+};
+
+struct dr_qp_rtr_attr {
+	struct mlx5dr_cmd_gid_attr dgid_attr;
+	enum ib_mtu mtu;
+	u32 qp_num;
+	u16 port_num;
+	u8 min_rnr_timer;
+	u8 sgid_index;
+	u16 udp_src_port;
+};
+
+struct dr_qp_rts_attr {
+	u8 timeout;
+	u8 retry_cnt;
+	u8 rnr_retry;
+};
+
+struct dr_qp_init_attr {
+	u32 cqn;
+	u32 pdn;
+	u32 max_send_wr;
+	struct mlx5_uars_page *uar;
+};
+
+static int dr_parse_cqe(struct mlx5dr_cq *dr_cq, struct mlx5_cqe64 *cqe64)
+{
+	unsigned int idx;
+	u8 opcode;
+
+	opcode = get_cqe_opcode(cqe64);
+	if (opcode == MLX5_CQE_REQ_ERR) {
+		idx = be16_to_cpu(cqe64->wqe_counter) &
+			(dr_cq->qp->sq.wqe_cnt - 1);
+		dr_cq->qp->sq.cc = dr_cq->qp->sq.wqe_head[idx] + 1;
+	} else if (opcode == MLX5_CQE_RESP_ERR) {
+		++dr_cq->qp->sq.cc;
+	} else {
+		idx = be16_to_cpu(cqe64->wqe_counter) &
+			(dr_cq->qp->sq.wqe_cnt - 1);
+		dr_cq->qp->sq.cc = dr_cq->qp->sq.wqe_head[idx] + 1;
+
+		return CQ_OK;
+	}
+
+	return CQ_POLL_ERR;
+}
+
+static int dr_cq_poll_one(struct mlx5dr_cq *dr_cq)
+{
+	struct mlx5_cqe64 *cqe64;
+	int err;
+
+	cqe64 = mlx5_cqwq_get_cqe(&dr_cq->wq);
+	if (!cqe64)
+		return CQ_EMPTY;
+
+	mlx5_cqwq_pop(&dr_cq->wq);
+	err = dr_parse_cqe(dr_cq, cqe64);
+	mlx5_cqwq_update_db_record(&dr_cq->wq);
+
+	return err;
+}
+
+static int dr_poll_cq(struct mlx5dr_cq *dr_cq, int ne)
+{
+	int npolled;
+	int err = 0;
+
+	for (npolled = 0; npolled < ne; ++npolled) {
+		err = dr_cq_poll_one(dr_cq);
+		if (err != CQ_OK)
+			break;
+	}
+
+	return err == CQ_POLL_ERR ? err : npolled;
+}
+
+static void dr_qp_event(struct mlx5_core_qp *mqp, int event)
+{
+	pr_info("DR QP event %u on QP #%u\n", event, mqp->qpn);
+}
+
+static struct mlx5dr_qp *dr_create_rc_qp(struct mlx5_core_dev *mdev,
+					 struct dr_qp_init_attr *attr)
+{
+	u32 temp_qpc[MLX5_ST_SZ_DW(qpc)] = {};
+	struct mlx5_wq_param wqp;
+	struct mlx5dr_qp *dr_qp;
+	int inlen;
+	void *qpc;
+	void *in;
+	int err;
+
+	dr_qp = kzalloc(sizeof(*dr_qp), GFP_KERNEL);
+	if (!dr_qp)
+		return NULL;
+
+	wqp.buf_numa_node = mdev->priv.numa_node;
+	wqp.db_numa_node = mdev->priv.numa_node;
+
+	dr_qp->rq.pc = 0;
+	dr_qp->rq.cc = 0;
+	dr_qp->rq.wqe_cnt = 4;
+	dr_qp->sq.pc = 0;
+	dr_qp->sq.cc = 0;
+	dr_qp->sq.wqe_cnt = roundup_pow_of_two(attr->max_send_wr);
+
+	MLX5_SET(qpc, temp_qpc, log_rq_stride, ilog2(MLX5_SEND_WQE_DS) - 4);
+	MLX5_SET(qpc, temp_qpc, log_rq_size, ilog2(dr_qp->rq.wqe_cnt));
+	MLX5_SET(qpc, temp_qpc, log_sq_size, ilog2(dr_qp->sq.wqe_cnt));
+	err = mlx5_wq_qp_create(mdev, &wqp, temp_qpc, &dr_qp->wq,
+				&dr_qp->wq_ctrl);
+	if (err) {
+		mlx5_core_info(mdev, "Can't create QP WQ\n");
+		goto err_wq;
+	}
+
+	dr_qp->sq.wqe_head = kcalloc(dr_qp->sq.wqe_cnt,
+				     sizeof(dr_qp->sq.wqe_head[0]),
+				     GFP_KERNEL);
+
+	if (!dr_qp->sq.wqe_head) {
+		mlx5_core_warn(mdev, "Can't allocate wqe head\n");
+		goto err_wqe_head;
+	}
+
+	inlen = MLX5_ST_SZ_BYTES(create_qp_in) +
+		MLX5_FLD_SZ_BYTES(create_qp_in, pas[0]) *
+		dr_qp->wq_ctrl.buf.npages;
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in) {
+		err = -ENOMEM;
+		goto err_in;
+	}
+
+	qpc = MLX5_ADDR_OF(create_qp_in, in, qpc);
+	MLX5_SET(qpc, qpc, st, MLX5_QP_ST_RC);
+	MLX5_SET(qpc, qpc, pm_state, MLX5_QP_PM_MIGRATED);
+	MLX5_SET(qpc, qpc, pd, attr->pdn);
+	MLX5_SET(qpc, qpc, uar_page, attr->uar->index);
+	MLX5_SET(qpc, qpc, log_page_size,
+		 dr_qp->wq_ctrl.buf.page_shift - MLX5_ADAPTER_PAGE_SHIFT);
+	MLX5_SET(qpc, qpc, fre, 1);
+	MLX5_SET(qpc, qpc, rlky, 1);
+	MLX5_SET(qpc, qpc, cqn_snd, attr->cqn);
+	MLX5_SET(qpc, qpc, cqn_rcv, attr->cqn);
+	MLX5_SET(qpc, qpc, log_rq_stride, ilog2(MLX5_SEND_WQE_DS) - 4);
+	MLX5_SET(qpc, qpc, log_rq_size, ilog2(dr_qp->rq.wqe_cnt));
+	MLX5_SET(qpc, qpc, rq_type, MLX5_NON_ZERO_RQ);
+	MLX5_SET(qpc, qpc, log_sq_size, ilog2(dr_qp->sq.wqe_cnt));
+	MLX5_SET64(qpc, qpc, dbr_addr, dr_qp->wq_ctrl.db.dma);
+	if (MLX5_CAP_GEN(mdev, cqe_version) == 1)
+		MLX5_SET(qpc, qpc, user_index, 0xFFFFFF);
+	mlx5_fill_page_frag_array(&dr_qp->wq_ctrl.buf,
+				  (__be64 *)MLX5_ADDR_OF(create_qp_in,
+							 in, pas));
+
+	err = mlx5_core_create_qp(mdev, &dr_qp->mqp, in, inlen);
+	kfree(in);
+
+	if (err) {
+		mlx5_core_warn(mdev, " Can't create QP\n");
+		goto err_in;
+	}
+	dr_qp->mqp.event = dr_qp_event;
+	dr_qp->uar = attr->uar;
+
+	return dr_qp;
+
+err_in:
+	kfree(dr_qp->sq.wqe_head);
+err_wqe_head:
+	mlx5_wq_destroy(&dr_qp->wq_ctrl);
+err_wq:
+	kfree(dr_qp);
+	return NULL;
+}
+
+static void dr_destroy_qp(struct mlx5_core_dev *mdev,
+			  struct mlx5dr_qp *dr_qp)
+{
+	mlx5_core_destroy_qp(mdev, &dr_qp->mqp);
+	kfree(dr_qp->sq.wqe_head);
+	mlx5_wq_destroy(&dr_qp->wq_ctrl);
+	kfree(dr_qp);
+}
+
+static void dr_cmd_notify_hw(struct mlx5dr_qp *dr_qp, void *ctrl)
+{
+	dma_wmb();
+	*dr_qp->wq.sq.db = cpu_to_be32(dr_qp->sq.pc & 0xfffff);
+
+	/* After wmb() the hw aware of new work */
+	wmb();
+
+	mlx5_write64(ctrl, dr_qp->uar->map + MLX5_BF_OFFSET);
+}
+
+static void dr_rdma_segments(struct mlx5dr_qp *dr_qp, u64 remote_addr,
+			     u32 rkey, struct dr_data_seg *data_seg,
+			     u32 opcode, int nreq)
+{
+	struct mlx5_wqe_raddr_seg *wq_raddr;
+	struct mlx5_wqe_ctrl_seg *wq_ctrl;
+	struct mlx5_wqe_data_seg *wq_dseg;
+	unsigned int size;
+	unsigned int idx;
+
+	size = sizeof(*wq_ctrl) / 16 + sizeof(*wq_dseg) / 16 +
+		sizeof(*wq_raddr) / 16;
+
+	idx = dr_qp->sq.pc & (dr_qp->sq.wqe_cnt - 1);
+
+	wq_ctrl = mlx5_wq_cyc_get_wqe(&dr_qp->wq.sq, idx);
+	wq_ctrl->imm = 0;
+	wq_ctrl->fm_ce_se = (data_seg->send_flags) ?
+		MLX5_WQE_CTRL_CQ_UPDATE : 0;
+	wq_ctrl->opmod_idx_opcode = cpu_to_be32(((dr_qp->sq.pc & 0xffff) << 8) |
+						opcode);
+	wq_ctrl->qpn_ds = cpu_to_be32(size | dr_qp->mqp.qpn << 8);
+	wq_raddr = (void *)(wq_ctrl + 1);
+	wq_raddr->raddr = cpu_to_be64(remote_addr);
+	wq_raddr->rkey = cpu_to_be32(rkey);
+	wq_raddr->reserved = 0;
+
+	wq_dseg = (void *)(wq_raddr + 1);
+	wq_dseg->byte_count = cpu_to_be32(data_seg->length);
+	wq_dseg->lkey = cpu_to_be32(data_seg->lkey);
+	wq_dseg->addr = cpu_to_be64(data_seg->addr);
+
+	dr_qp->sq.wqe_head[idx] = dr_qp->sq.pc++;
+
+	if (nreq)
+		dr_cmd_notify_hw(dr_qp, wq_ctrl);
+}
+
+static void dr_post_send(struct mlx5dr_qp *dr_qp, struct postsend_info *send_info)
+{
+	dr_rdma_segments(dr_qp, send_info->remote_addr, send_info->rkey,
+			 &send_info->write, MLX5_OPCODE_RDMA_WRITE, 0);
+	dr_rdma_segments(dr_qp, send_info->remote_addr, send_info->rkey,
+			 &send_info->read, MLX5_OPCODE_RDMA_READ, 1);
+}
+
+/**
+ * mlx5dr_send_fill_and_append_ste_send_info: Add data to be sent
+ * with send_list parameters:
+ *
+ *     @ste:       The data that attached to this specific ste
+ *     @size:      of data to write
+ *     @offset:    of the data from start of the hw_ste entry
+ *     @data:      data
+ *     @ste_info:  ste to be sent with send_list
+ *     @send_list: to append into it
+ *     @copy_data: if true indicates that the data should be kept because
+ *                 it's not backuped any where (like in re-hash).
+ *                 if false, it lets the data to be updated after
+ *                 it was added to the list.
+ */
+void mlx5dr_send_fill_and_append_ste_send_info(struct mlx5dr_ste *ste, u16 size,
+					       u16 offset, u8 *data,
+					       struct mlx5dr_ste_send_info *ste_info,
+					       struct list_head *send_list,
+					       bool copy_data)
+{
+	ste_info->size = size;
+	ste_info->ste = ste;
+	ste_info->offset = offset;
+
+	if (copy_data) {
+		memcpy(ste_info->data_cont, data, size);
+		ste_info->data = ste_info->data_cont;
+	} else {
+		ste_info->data = data;
+	}
+
+	list_add_tail(&ste_info->send_list, send_list);
+}
+
+/* The function tries to consume one wc each time, unless the queue is full, in
+ * that case, which means that the hw is behind the sw in a full queue len
+ * the function will drain the cq till it empty.
+ */
+static int dr_handle_pending_wc(struct mlx5dr_domain *dmn,
+				struct mlx5dr_send_ring *send_ring)
+{
+	bool is_drain = false;
+	int ne;
+
+	if (send_ring->pending_wqe < send_ring->signal_th)
+		return 0;
+
+	/* Queue is full start drain it */
+	if (send_ring->pending_wqe >=
+	    dmn->send_ring->signal_th * TH_NUMS_TO_DRAIN)
+		is_drain = true;
+
+	do {
+		ne = dr_poll_cq(send_ring->cq, 1);
+		if (ne < 0)
+			return ne;
+		else if (ne == 1)
+			send_ring->pending_wqe -= send_ring->signal_th;
+	} while (is_drain && send_ring->pending_wqe);
+
+	return 0;
+}
+
+static void dr_fill_data_segs(struct mlx5dr_send_ring *send_ring,
+			      struct postsend_info *send_info)
+{
+	send_ring->pending_wqe++;
+
+	if (send_ring->pending_wqe % send_ring->signal_th == 0)
+		send_info->write.send_flags |= IB_SEND_SIGNALED;
+
+	send_ring->pending_wqe++;
+	send_info->read.length = send_info->write.length;
+	/* Read into the same write area */
+	send_info->read.addr = (uintptr_t)send_info->write.addr;
+	send_info->read.lkey = send_ring->mr->mkey.key;
+
+	if (send_ring->pending_wqe % send_ring->signal_th == 0)
+		send_info->read.send_flags = IB_SEND_SIGNALED;
+	else
+		send_info->read.send_flags = 0;
+}
+
+static int dr_postsend_icm_data(struct mlx5dr_domain *dmn,
+				struct postsend_info *send_info)
+{
+	struct mlx5dr_send_ring *send_ring = dmn->send_ring;
+	u32 buff_offset;
+	int ret;
+
+	ret = dr_handle_pending_wc(dmn, send_ring);
+	if (ret)
+		return ret;
+
+	if (send_info->write.length > dmn->info.max_inline_size) {
+		buff_offset = (send_ring->tx_head &
+			       (dmn->send_ring->signal_th - 1)) *
+			send_ring->max_post_send_size;
+		/* Copy to ring mr */
+		memcpy(send_ring->buf + buff_offset,
+		       (void *)(uintptr_t)send_info->write.addr,
+		       send_info->write.length);
+		send_info->write.addr = (uintptr_t)send_ring->mr->dma_addr + buff_offset;
+		send_info->write.lkey = send_ring->mr->mkey.key;
+	}
+
+	send_ring->tx_head++;
+	dr_fill_data_segs(send_ring, send_info);
+	dr_post_send(send_ring->qp, send_info);
+
+	return 0;
+}
+
+static int dr_get_tbl_copy_details(struct mlx5dr_domain *dmn,
+				   struct mlx5dr_ste_htbl *htbl,
+				   u8 **data,
+				   u32 *byte_size,
+				   int *iterations,
+				   int *num_stes)
+{
+	int alloc_size;
+
+	if (htbl->chunk->byte_size > dmn->send_ring->max_post_send_size) {
+		*iterations = htbl->chunk->byte_size /
+			dmn->send_ring->max_post_send_size;
+		*byte_size = dmn->send_ring->max_post_send_size;
+		alloc_size = *byte_size;
+		*num_stes = *byte_size / DR_STE_SIZE;
+	} else {
+		*iterations = 1;
+		*num_stes = htbl->chunk->num_of_entries;
+		alloc_size = *num_stes * DR_STE_SIZE;
+	}
+
+	*data = kzalloc(alloc_size, GFP_KERNEL);
+	if (!*data)
+		return -ENOMEM;
+
+	return 0;
+}
+
+/**
+ * mlx5dr_send_postsend_ste: write size bytes into offset from the hw cm.
+ *
+ *     @dmn:    Domain
+ *     @ste:    The ste struct that contains the data (at
+ *              least part of it)
+ *     @data:   The real data to send size data
+ *     @size:   for writing.
+ *     @offset: The offset from the icm mapped data to
+ *              start write to this for write only part of the
+ *              buffer.
+ *
+ * Return: 0 on success.
+ */
+int mlx5dr_send_postsend_ste(struct mlx5dr_domain *dmn, struct mlx5dr_ste *ste,
+			     u8 *data, u16 size, u16 offset)
+{
+	struct postsend_info send_info = {};
+
+	send_info.write.addr = (uintptr_t)data;
+	send_info.write.length = size;
+	send_info.write.lkey = 0;
+	send_info.remote_addr = mlx5dr_ste_get_mr_addr(ste) + offset;
+	send_info.rkey = ste->htbl->chunk->rkey;
+
+	return dr_postsend_icm_data(dmn, &send_info);
+}
+
+int mlx5dr_send_postsend_htbl(struct mlx5dr_domain *dmn,
+			      struct mlx5dr_ste_htbl *htbl,
+			      u8 *formatted_ste, u8 *mask)
+{
+	u32 byte_size = htbl->chunk->byte_size;
+	int num_stes_per_iter;
+	int iterations;
+	u8 *data;
+	int ret;
+	int i;
+	int j;
+
+	ret = dr_get_tbl_copy_details(dmn, htbl, &data, &byte_size,
+				      &iterations, &num_stes_per_iter);
+	if (ret)
+		return ret;
+
+	/* Send the data iteration times */
+	for (i = 0; i < iterations; i++) {
+		u32 ste_index = i * (byte_size / DR_STE_SIZE);
+		struct postsend_info send_info = {};
+
+		/* Copy all ste's on the data buffer
+		 * need to add the bit_mask
+		 */
+		for (j = 0; j < num_stes_per_iter; j++) {
+			u8 *hw_ste = htbl->ste_arr[ste_index + j].hw_ste;
+			u32 ste_off = j * DR_STE_SIZE;
+
+			if (mlx5dr_ste_is_not_valid_entry(hw_ste)) {
+				memcpy(data + ste_off,
+				       formatted_ste, DR_STE_SIZE);
+			} else {
+				/* Copy data */
+				memcpy(data + ste_off,
+				       htbl->ste_arr[ste_index + j].hw_ste,
+				       DR_STE_SIZE_REDUCED);
+				/* Copy bit_mask */
+				memcpy(data + ste_off + DR_STE_SIZE_REDUCED,
+				       mask, DR_STE_SIZE_MASK);
+			}
+		}
+
+		send_info.write.addr = (uintptr_t)data;
+		send_info.write.length = byte_size;
+		send_info.write.lkey = 0;
+		send_info.remote_addr =
+			mlx5dr_ste_get_mr_addr(htbl->ste_arr + ste_index);
+		send_info.rkey = htbl->chunk->rkey;
+
+		ret = dr_postsend_icm_data(dmn, &send_info);
+		if (ret)
+			goto out_free;
+	}
+
+out_free:
+	kfree(data);
+	return ret;
+}
+
+/* Initialize htble with default STEs */
+int mlx5dr_send_postsend_formatted_htbl(struct mlx5dr_domain *dmn,
+					struct mlx5dr_ste_htbl *htbl,
+					u8 *ste_init_data,
+					bool update_hw_ste)
+{
+	u32 byte_size = htbl->chunk->byte_size;
+	int iterations;
+	int num_stes;
+	u8 *data;
+	int ret;
+	int i;
+
+	ret = dr_get_tbl_copy_details(dmn, htbl, &data, &byte_size,
+				      &iterations, &num_stes);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < num_stes; i++) {
+		u8 *copy_dst;
+
+		/* Copy the same ste on the data buffer */
+		copy_dst = data + i * DR_STE_SIZE;
+		memcpy(copy_dst, ste_init_data, DR_STE_SIZE);
+
+		if (update_hw_ste) {
+			/* Copy the reduced ste to hash table ste_arr */
+			copy_dst = htbl->hw_ste_arr + i * DR_STE_SIZE_REDUCED;
+			memcpy(copy_dst, ste_init_data, DR_STE_SIZE_REDUCED);
+		}
+	}
+
+	/* Send the data iteration times */
+	for (i = 0; i < iterations; i++) {
+		u8 ste_index = i * (byte_size / DR_STE_SIZE);
+		struct postsend_info send_info = {};
+
+		send_info.write.addr = (uintptr_t)data;
+		send_info.write.length = byte_size;
+		send_info.write.lkey = 0;
+		send_info.remote_addr =
+			mlx5dr_ste_get_mr_addr(htbl->ste_arr + ste_index);
+		send_info.rkey = htbl->chunk->rkey;
+
+		ret = dr_postsend_icm_data(dmn, &send_info);
+		if (ret)
+			goto out_free;
+	}
+
+out_free:
+	kfree(data);
+	return ret;
+}
+
+int mlx5dr_send_postsend_action(struct mlx5dr_domain *dmn,
+				struct mlx5dr_action *action)
+{
+	struct postsend_info send_info = {};
+	int ret;
+
+	send_info.write.addr = (uintptr_t)action->rewrite.data;
+	send_info.write.length = action->rewrite.chunk->byte_size;
+	send_info.write.lkey = 0;
+	send_info.remote_addr = action->rewrite.chunk->mr_addr;
+	send_info.rkey = action->rewrite.chunk->rkey;
+
+	mutex_lock(&dmn->mutex);
+	ret = dr_postsend_icm_data(dmn, &send_info);
+	mutex_unlock(&dmn->mutex);
+
+	return ret;
+}
+
+static int dr_modify_qp_rst2init(struct mlx5_core_dev *mdev,
+				 struct mlx5dr_qp *dr_qp,
+				 int port)
+{
+	u32 in[MLX5_ST_SZ_DW(rst2init_qp_in)] = {};
+	void *qpc;
+
+	qpc = MLX5_ADDR_OF(rst2init_qp_in, in, qpc);
+
+	MLX5_SET(qpc, qpc, primary_address_path.vhca_port_num, port);
+	MLX5_SET(qpc, qpc, pm_state, MLX5_QPC_PM_STATE_MIGRATED);
+	MLX5_SET(qpc, qpc, rre, 1);
+	MLX5_SET(qpc, qpc, rwe, 1);
+
+	return mlx5_core_qp_modify(mdev, MLX5_CMD_OP_RST2INIT_QP, 0, qpc,
+				   &dr_qp->mqp);
+}
+
+static int dr_cmd_modify_qp_rtr2rts(struct mlx5_core_dev *mdev,
+				    struct mlx5dr_qp *dr_qp,
+				    struct dr_qp_rts_attr *attr)
+{
+	u32 in[MLX5_ST_SZ_DW(rtr2rts_qp_in)] = {};
+	void *qpc;
+
+	qpc  = MLX5_ADDR_OF(rtr2rts_qp_in, in, qpc);
+
+	MLX5_SET(rtr2rts_qp_in, in, qpn, dr_qp->mqp.qpn);
+
+	MLX5_SET(qpc, qpc, log_ack_req_freq, 0);
+	MLX5_SET(qpc, qpc, retry_count, attr->retry_cnt);
+	MLX5_SET(qpc, qpc, rnr_retry, attr->rnr_retry);
+
+	return mlx5_core_qp_modify(mdev, MLX5_CMD_OP_RTR2RTS_QP, 0, qpc,
+				   &dr_qp->mqp);
+}
+
+static int dr_cmd_modify_qp_init2rtr(struct mlx5_core_dev *mdev,
+				     struct mlx5dr_qp *dr_qp,
+				     struct dr_qp_rtr_attr *attr)
+{
+	u32 in[MLX5_ST_SZ_DW(init2rtr_qp_in)] = {};
+	void *qpc;
+
+	qpc = MLX5_ADDR_OF(init2rtr_qp_in, in, qpc);
+
+	MLX5_SET(init2rtr_qp_in, in, qpn, dr_qp->mqp.qpn);
+
+	MLX5_SET(qpc, qpc, mtu, attr->mtu);
+	MLX5_SET(qpc, qpc, log_msg_max, DR_CHUNK_SIZE_MAX - 1);
+	MLX5_SET(qpc, qpc, remote_qpn, attr->qp_num);
+	memcpy(MLX5_ADDR_OF(qpc, qpc, primary_address_path.rmac_47_32),
+	       attr->dgid_attr.mac, sizeof(attr->dgid_attr.mac));
+	memcpy(MLX5_ADDR_OF(qpc, qpc, primary_address_path.rgid_rip),
+	       attr->dgid_attr.gid, sizeof(attr->dgid_attr.gid));
+	MLX5_SET(qpc, qpc, primary_address_path.src_addr_index,
+		 attr->sgid_index);
+
+	if (attr->dgid_attr.roce_ver == MLX5_ROCE_VERSION_2)
+		MLX5_SET(qpc, qpc, primary_address_path.udp_sport,
+			 attr->udp_src_port);
+
+	MLX5_SET(qpc, qpc, primary_address_path.vhca_port_num, attr->port_num);
+	MLX5_SET(qpc, qpc, min_rnr_nak, 1);
+
+	return mlx5_core_qp_modify(mdev, MLX5_CMD_OP_INIT2RTR_QP, 0, qpc,
+				   &dr_qp->mqp);
+}
+
+static int dr_prepare_qp_to_rts(struct mlx5dr_domain *dmn)
+{
+	struct mlx5dr_qp *dr_qp = dmn->send_ring->qp;
+	struct dr_qp_rts_attr rts_attr = {};
+	struct dr_qp_rtr_attr rtr_attr = {};
+	enum ib_mtu mtu = IB_MTU_1024;
+	u16 gid_index = 0;
+	int port = 1;
+	int ret;
+
+	/* Init */
+	ret = dr_modify_qp_rst2init(dmn->mdev, dr_qp, port);
+	if (ret)
+		return ret;
+
+	/* RTR */
+	ret = mlx5dr_cmd_query_gid(dmn->mdev, port, gid_index, &rtr_attr.dgid_attr);
+	if (ret)
+		return ret;
+
+	rtr_attr.mtu		= mtu;
+	rtr_attr.qp_num		= dr_qp->mqp.qpn;
+	rtr_attr.min_rnr_timer	= 12;
+	rtr_attr.port_num	= port;
+	rtr_attr.sgid_index	= gid_index;
+	rtr_attr.udp_src_port	= dmn->info.caps.roce_min_src_udp;
+
+	ret = dr_cmd_modify_qp_init2rtr(dmn->mdev, dr_qp, &rtr_attr);
+	if (ret)
+		return ret;
+
+	/* RTS */
+	rts_attr.timeout	= 14;
+	rts_attr.retry_cnt	= 7;
+	rts_attr.rnr_retry	= 7;
+
+	ret = dr_cmd_modify_qp_rtr2rts(dmn->mdev, dr_qp, &rts_attr);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static void dr_cq_event(struct mlx5_core_cq *mcq,
+			enum mlx5_event event)
+{
+	pr_info("CQ event %u on CQ #%u\n", event, mcq->cqn);
+}
+
+static struct mlx5dr_cq *dr_create_cq(struct mlx5_core_dev *mdev,
+				      struct mlx5_uars_page *uar,
+				      size_t ncqe)
+{
+	u32 temp_cqc[MLX5_ST_SZ_DW(cqc)] = {};
+	u32 out[MLX5_ST_SZ_DW(create_cq_out)];
+	struct mlx5_wq_param wqp;
+	struct mlx5_cqe64 *cqe;
+	struct mlx5dr_cq *cq;
+	int inlen, err, eqn;
+	unsigned int irqn;
+	void *cqc, *in;
+	__be64 *pas;
+	u32 i;
+
+	cq = kzalloc(sizeof(*cq), GFP_KERNEL);
+	if (!cq)
+		return NULL;
+
+	ncqe = roundup_pow_of_two(ncqe);
+	MLX5_SET(cqc, temp_cqc, log_cq_size, ilog2(ncqe));
+
+	wqp.buf_numa_node = mdev->priv.numa_node;
+	wqp.db_numa_node = mdev->priv.numa_node;
+
+	err = mlx5_cqwq_create(mdev, &wqp, temp_cqc, &cq->wq,
+			       &cq->wq_ctrl);
+	if (err)
+		goto out;
+
+	for (i = 0; i < mlx5_cqwq_get_size(&cq->wq); i++) {
+		cqe = mlx5_cqwq_get_wqe(&cq->wq, i);
+		cqe->op_own = MLX5_CQE_INVALID << 4 | MLX5_CQE_OWNER_MASK;
+	}
+
+	inlen = MLX5_ST_SZ_BYTES(create_cq_in) +
+		sizeof(u64) * cq->wq_ctrl.buf.npages;
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		goto err_cqwq;
+
+	err = mlx5_vector2eqn(mdev, smp_processor_id(), &eqn, &irqn);
+	if (err) {
+		kvfree(in);
+		goto err_cqwq;
+	}
+
+	cqc = MLX5_ADDR_OF(create_cq_in, in, cq_context);
+	MLX5_SET(cqc, cqc, log_cq_size, ilog2(ncqe));
+	MLX5_SET(cqc, cqc, c_eqn, eqn);
+	MLX5_SET(cqc, cqc, uar_page, uar->index);
+	MLX5_SET(cqc, cqc, log_page_size, cq->wq_ctrl.buf.page_shift -
+		 MLX5_ADAPTER_PAGE_SHIFT);
+	MLX5_SET64(cqc, cqc, dbr_addr, cq->wq_ctrl.db.dma);
+
+	pas = (__be64 *)MLX5_ADDR_OF(create_cq_in, in, pas);
+	mlx5_fill_page_frag_array(&cq->wq_ctrl.buf, pas);
+
+	cq->mcq.event = dr_cq_event;
+
+	err = mlx5_core_create_cq(mdev, &cq->mcq, in, inlen, out, sizeof(out));
+	kvfree(in);
+
+	if (err)
+		goto err_cqwq;
+
+	cq->mcq.cqe_sz = 64;
+	cq->mcq.set_ci_db = cq->wq_ctrl.db.db;
+	cq->mcq.arm_db = cq->wq_ctrl.db.db + 1;
+	*cq->mcq.set_ci_db = 0;
+	*cq->mcq.arm_db = 0;
+	cq->mcq.vector = 0;
+	cq->mcq.irqn = irqn;
+	cq->mcq.uar = uar;
+
+	return cq;
+
+err_cqwq:
+	mlx5_wq_destroy(&cq->wq_ctrl);
+out:
+	kfree(cq);
+	return NULL;
+}
+
+static void dr_destroy_cq(struct mlx5_core_dev *mdev, struct mlx5dr_cq *cq)
+{
+	mlx5_core_destroy_cq(mdev, &cq->mcq);
+	mlx5_wq_destroy(&cq->wq_ctrl);
+	kfree(cq);
+}
+
+static int
+dr_create_mkey(struct mlx5_core_dev *mdev, u32 pdn, struct mlx5_core_mkey *mkey)
+{
+	u32 in[MLX5_ST_SZ_DW(create_mkey_in)] = {};
+	void *mkc;
+
+	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
+	MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_PA);
+	MLX5_SET(mkc, mkc, a, 1);
+	MLX5_SET(mkc, mkc, rw, 1);
+	MLX5_SET(mkc, mkc, rr, 1);
+	MLX5_SET(mkc, mkc, lw, 1);
+	MLX5_SET(mkc, mkc, lr, 1);
+
+	MLX5_SET(mkc, mkc, pd, pdn);
+	MLX5_SET(mkc, mkc, length64, 1);
+	MLX5_SET(mkc, mkc, qpn, 0xffffff);
+
+	return mlx5_core_create_mkey(mdev, mkey, in, sizeof(in));
+}
+
+static struct mlx5dr_mr *dr_reg_mr(struct mlx5_core_dev *mdev,
+				   u32 pdn, void *buf, size_t size)
+{
+	struct mlx5dr_mr *mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+	struct device *dma_device;
+	dma_addr_t dma_addr;
+	int err;
+
+	if (!mr)
+		return NULL;
+
+	dma_device = &mdev->pdev->dev;
+	dma_addr = dma_map_single(dma_device, buf, size,
+				  DMA_BIDIRECTIONAL);
+	err = dma_mapping_error(dma_device, dma_addr);
+	if (err) {
+		mlx5_core_warn(mdev, "Can't dma buf\n");
+		kfree(mr);
+		return NULL;
+	}
+
+	err = dr_create_mkey(mdev, pdn, &mr->mkey);
+	if (err) {
+		mlx5_core_warn(mdev, "Can't create mkey\n");
+		dma_unmap_single(dma_device, dma_addr, size,
+				 DMA_BIDIRECTIONAL);
+		kfree(mr);
+		return NULL;
+	}
+
+	mr->dma_addr = dma_addr;
+	mr->size = size;
+	mr->addr = buf;
+
+	return mr;
+}
+
+static void dr_dereg_mr(struct mlx5_core_dev *mdev, struct mlx5dr_mr *mr)
+{
+	mlx5_core_destroy_mkey(mdev, &mr->mkey);
+	dma_unmap_single(&mdev->pdev->dev, mr->dma_addr, mr->size,
+			 DMA_BIDIRECTIONAL);
+	kfree(mr);
+}
+
+int mlx5dr_send_ring_alloc(struct mlx5dr_domain *dmn)
+{
+	struct dr_qp_init_attr init_attr = {};
+	int cq_size;
+	int size;
+	int ret;
+
+	dmn->send_ring = kzalloc(sizeof(*dmn->send_ring), GFP_KERNEL);
+	if (!dmn->send_ring)
+		return -ENOMEM;
+
+	cq_size = QUEUE_SIZE + 1;
+	dmn->send_ring->cq = dr_create_cq(dmn->mdev, dmn->uar, cq_size);
+	if (!dmn->send_ring->cq) {
+		ret = -ENOMEM;
+		goto free_send_ring;
+	}
+
+	init_attr.cqn = dmn->send_ring->cq->mcq.cqn;
+	init_attr.pdn = dmn->pdn;
+	init_attr.uar = dmn->uar;
+	init_attr.max_send_wr = QUEUE_SIZE;
+
+	dmn->send_ring->qp = dr_create_rc_qp(dmn->mdev, &init_attr);
+	if (!dmn->send_ring->qp)  {
+		ret = -ENOMEM;
+		goto clean_cq;
+	}
+
+	dmn->send_ring->cq->qp = dmn->send_ring->qp;
+
+	dmn->info.max_send_wr = QUEUE_SIZE;
+	dmn->info.max_inline_size = min(dmn->send_ring->qp->max_inline_data,
+					DR_STE_SIZE);
+
+	dmn->send_ring->signal_th = dmn->info.max_send_wr /
+		SIGNAL_PER_DIV_QUEUE;
+
+	/* Prepare qp to be used */
+	ret = dr_prepare_qp_to_rts(dmn);
+	if (ret)
+		goto clean_qp;
+
+	dmn->send_ring->max_post_send_size =
+		mlx5dr_icm_pool_chunk_size_to_byte(DR_CHUNK_SIZE_1K,
+						   DR_ICM_TYPE_STE);
+
+	/* Allocating the max size as a buffer for writing */
+	size = dmn->send_ring->signal_th * dmn->send_ring->max_post_send_size;
+	dmn->send_ring->buf = kzalloc(size, GFP_KERNEL);
+	if (!dmn->send_ring->buf) {
+		ret = -ENOMEM;
+		goto clean_qp;
+	}
+
+	memset(dmn->send_ring->buf, 0, size);
+	dmn->send_ring->buf_size = size;
+
+	dmn->send_ring->mr = dr_reg_mr(dmn->mdev,
+				       dmn->pdn, dmn->send_ring->buf, size);
+	if (!dmn->send_ring->mr) {
+		ret = -ENOMEM;
+		goto free_mem;
+	}
+
+	dmn->send_ring->sync_mr = dr_reg_mr(dmn->mdev,
+					    dmn->pdn, dmn->send_ring->sync_buff,
+					    MIN_READ_SYNC);
+	if (!dmn->send_ring->sync_mr) {
+		ret = -ENOMEM;
+		goto clean_mr;
+	}
+
+	return 0;
+
+clean_mr:
+	dr_dereg_mr(dmn->mdev, dmn->send_ring->mr);
+free_mem:
+	kfree(dmn->send_ring->buf);
+clean_qp:
+	dr_destroy_qp(dmn->mdev, dmn->send_ring->qp);
+clean_cq:
+	dr_destroy_cq(dmn->mdev, dmn->send_ring->cq);
+free_send_ring:
+	kfree(dmn->send_ring);
+
+	return ret;
+}
+
+void mlx5dr_send_ring_free(struct mlx5dr_domain *dmn,
+			   struct mlx5dr_send_ring *send_ring)
+{
+	dr_destroy_qp(dmn->mdev, send_ring->qp);
+	dr_destroy_cq(dmn->mdev, send_ring->cq);
+	dr_dereg_mr(dmn->mdev, send_ring->sync_mr);
+	dr_dereg_mr(dmn->mdev, send_ring->mr);
+	kfree(send_ring->buf);
+	kfree(send_ring);
+}
+
+int mlx5dr_send_ring_force_drain(struct mlx5dr_domain *dmn)
+{
+	struct mlx5dr_send_ring *send_ring = dmn->send_ring;
+	struct postsend_info send_info = {};
+	u8 data[DR_STE_SIZE];
+	int num_of_sends_req;
+	int ret;
+	int i;
+
+	/* Sending this amount of requests makes sure we will get drain */
+	num_of_sends_req = send_ring->signal_th * TH_NUMS_TO_DRAIN / 2;
+
+	/* Send fake requests forcing the last to be signaled */
+	send_info.write.addr = (uintptr_t)data;
+	send_info.write.length = DR_STE_SIZE;
+	send_info.write.lkey = 0;
+	/* Using the sync_mr in order to write/read */
+	send_info.remote_addr = (uintptr_t)send_ring->sync_mr->addr;
+	send_info.rkey = send_ring->sync_mr->mkey.key;
+
+	for (i = 0; i < num_of_sends_req; i++) {
+		ret = dr_postsend_icm_data(dmn, &send_info);
+		if (ret)
+			return ret;
+	}
+
+	ret = dr_handle_pending_wc(dmn, send_ring);
+
+	return ret;
+}
-- 
2.21.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox