Linux RDMA and InfiniBand development

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

* Re: [PATCH] net/mlx4_en: Fix bpf_prog_add ref_cnt in mlx4
From: Zhiyi Sun @ 2016-11-09  9:45 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bblanco-uqk4Ao+rVK5Wk0Htik3J/w, Tariq Toukan, Yishai Hadas,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <5822E6DB.40204-FeC+5ew28dpmcu3hnIyYJQ@public.gmane.org>

On Wed, Nov 09, 2016 at 10:05:31AM +0100, Daniel Borkmann wrote:
> On 11/09/2016 08:35 AM, Zhiyi Sun wrote:
> > There are rx_ring_num queues. Each queue will load xdp prog. So
> > bpf_prog_add() should add rx_ring_num to ref_cnt.
> > 
> > Signed-off-by: Zhiyi Sun <zhiyisun-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> 
> Your analysis looks incorrect to me. Please elaborate in more detail why
> you think current code is buggy ...
> 

Yes, you are correct. My patch is incorrect. It is not a bug.

> Call path is dev_change_xdp_fd(), which does bpf_prog_get_type() on the
> fd. This already takes a ref and only drops it in case of error. Thus
> in mlx4_xdp_set(), you only need priv->rx_ring_num - 1 refs for the rest
> of the rings, so that dropping refs from old_prog makes sure we release
> it again. Looks correct to me (maybe a comment would have helped there).
> 

I thought mlx4's code is incorrect because in mlx5's driver, function
mlx5e_xdp_set() calls a pair of bpf_prog_add/put, the number of add and
put to the refs are same. I didn't notice that one "add" has been called in its
calller. So, it seems that mlx5's code is incorrect, right?

> >   drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> > index 12c99a2..d25e150 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> > @@ -2650,7 +2650,7 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
> >   	 */
> >   	if (priv->xdp_ring_num == xdp_ring_num) {
> >   		if (prog) {
> > -			prog = bpf_prog_add(prog, priv->rx_ring_num - 1);
> > +			prog = bpf_prog_add(prog, priv->rx_ring_num);
> >   			if (IS_ERR(prog))
> >   				return PTR_ERR(prog);
> >   		}
> > @@ -2680,7 +2680,7 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
> >   	}
> > 
> >   	if (prog) {
> > -		prog = bpf_prog_add(prog, priv->rx_ring_num - 1);
> > +		prog = bpf_prog_add(prog, priv->rx_ring_num);
> >   		if (IS_ERR(prog))
> >   			return PTR_ERR(prog);
> >   	}
> > 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net/mlx4_en: Fix bpf_prog_add ref_cnt in mlx4
From: Daniel Borkmann @ 2016-11-09  9:57 UTC (permalink / raw)
  To: Zhiyi Sun
  Cc: bblanco, Tariq Toukan, Yishai Hadas, netdev, linux-rdma,
	linux-kernel, alexei.starovoitov
In-Reply-To: <20161109094546.jtmzc4xwtaavzcnt@ubuntu>

[-- Attachment #1: Type: text/plain, Size: 1365 bytes --]

On 11/09/2016 10:45 AM, Zhiyi Sun wrote:
> On Wed, Nov 09, 2016 at 10:05:31AM +0100, Daniel Borkmann wrote:
>> On 11/09/2016 08:35 AM, Zhiyi Sun wrote:
>>> There are rx_ring_num queues. Each queue will load xdp prog. So
>>> bpf_prog_add() should add rx_ring_num to ref_cnt.
>>>
>>> Signed-off-by: Zhiyi Sun <zhiyisun@gmail.com>
>>
>> Your analysis looks incorrect to me. Please elaborate in more detail why
>> you think current code is buggy ...
>
> Yes, you are correct. My patch is incorrect. It is not a bug.
>
>> Call path is dev_change_xdp_fd(), which does bpf_prog_get_type() on the
>> fd. This already takes a ref and only drops it in case of error. Thus
>> in mlx4_xdp_set(), you only need priv->rx_ring_num - 1 refs for the rest
>> of the rings, so that dropping refs from old_prog makes sure we release
>> it again. Looks correct to me (maybe a comment would have helped there).
>
> I thought mlx4's code is incorrect because in mlx5's driver, function
> mlx5e_xdp_set() calls a pair of bpf_prog_add/put, the number of add and
> put to the refs are same. I didn't notice that one "add" has been called in its
> calller. So, it seems that mlx5's code is incorrect, right?

Yep, I think the two attached patches are needed.

The other thing I noticed in mlx5e_create_rq() is that it calls
bpf_prog_add(rq->xdp_prog, 1) without actually checking for errors.

[-- Attachment #2: 0001-bpf-mlx4-fix-prog-refcount-in-mlx4_en_try_alloc_reso.patch --]
[-- Type: text/x-patch, Size: 3022 bytes --]

>From d2bd6b3cd8636716a06b0ea3b1e041e16f87cce0 Mon Sep 17 00:00:00 2001
Message-Id: <d2bd6b3cd8636716a06b0ea3b1e041e16f87cce0.1478685278.git.daniel@iogearbox.net>
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Wed, 9 Nov 2016 10:31:19 +0100
Subject: [PATCH net-next 1/2] bpf, mlx4: fix prog refcount in mlx4_en_try_alloc_resources error path

Commit 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings
scheme") added a bug in that the prog's reference count is not dropped
in the error path when mlx4_en_try_alloc_resources() is failing.

We previously took bpf_prog_add(prog, priv->rx_ring_num - 1), that we
need to release again. Earlier in the call-path, dev_change_xdp_fd()
itself holds a ref to the prog as well, which is then released though
bpf_prog_put() due to the propagated error.

Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  5 ++++-
 include/linux/bpf.h                            |  1 +
 kernel/bpf/syscall.c                           | 11 +++++++++++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 0f6225c..4104aec 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2747,8 +2747,11 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 	}
 
 	err = mlx4_en_try_alloc_resources(priv, tmp, &new_prof);
-	if (err)
+	if (err) {
+		if (prog)
+			bpf_prog_add_undo(prog, priv->rx_ring_num - 1);
 		goto unlock_out;
+	}
 
 	if (priv->port_up) {
 		port_up = 1;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index edcd96d..4f6a4f1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -234,6 +234,7 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 struct bpf_prog *bpf_prog_get(u32 ufd);
 struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type);
 struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i);
+void bpf_prog_add_undo(struct bpf_prog *prog, int i);
 struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog);
 void bpf_prog_put(struct bpf_prog *prog);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..a6e4dd8 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -680,6 +680,17 @@ struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i)
 }
 EXPORT_SYMBOL_GPL(bpf_prog_add);
 
+void bpf_prog_add_undo(struct bpf_prog *prog, int i)
+{
+	/* Only to be used for undoing previous bpf_prog_add() in some
+	 * error path. We still know that another entity in our call
+	 * path holds a reference to the program, thus atomic_sub() can
+	 * be safely used here!
+	 */
+	atomic_sub(i, &prog->aux->refcnt);
+}
+EXPORT_SYMBOL_GPL(bpf_prog_add_undo);
+
 struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
 {
 	return bpf_prog_add(prog, 1);
-- 
1.9.3


[-- Attachment #3: 0002-bpf-mlx5-fix-prog-refcount-in-mlx5e_xdp_set.patch --]
[-- Type: text/x-patch, Size: 1474 bytes --]

>From f0789544432bbb89c53c3b8ac6575d48fed97786 Mon Sep 17 00:00:00 2001
Message-Id: <f0789544432bbb89c53c3b8ac6575d48fed97786.1478685278.git.daniel@iogearbox.net>
In-Reply-To: <d2bd6b3cd8636716a06b0ea3b1e041e16f87cce0.1478685278.git.daniel@iogearbox.net>
References: <d2bd6b3cd8636716a06b0ea3b1e041e16f87cce0.1478685278.git.daniel@iogearbox.net>
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Wed, 9 Nov 2016 10:51:26 +0100
Subject: [PATCH net-next 2/2] bpf, mlx5: fix prog refcount in mlx5e_xdp_set

dev_change_xdp_fd() already holds a reference, so bpf_prog_add(prog, 1)
is not correct as it takes one reference too much and will thus leak
the prog eventually. Also, bpf_prog_add() can fail and is not checked
for errors here.

Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index ba0c774..63309dd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3121,8 +3121,6 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 
 	/* exchange programs */
 	old_prog = xchg(&priv->xdp_prog, prog);
-	if (prog)
-		bpf_prog_add(prog, 1);
 	if (old_prog)
 		bpf_prog_put(old_prog);
 
-- 
1.9.3


^ permalink raw reply related

* Re: [PATCH net-next 1/2] bpf, mlx4: fix prog refcount in mlx4_en_try_alloc_resources error path
From: kbuild test robot @ 2016-11-09 10:58 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: kbuild-all-JC7UmRfGjtg, Zhiyi Sun, bblanco-uqk4Ao+rVK5Wk0Htik3J/w,
	Tariq Toukan, Yishai Hadas, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <5822F30C.1050900-FeC+5ew28dpmcu3hnIyYJQ@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1365 bytes --]

Hi Daniel,

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Daniel-Borkmann/bpf-mlx4-fix-prog-refcount-in-mlx4_en_try_alloc_resources-error-path/20161109-182712
config: x86_64-acpi-redef (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/mellanox/mlx4/en_netdev.c: In function 'mlx4_xdp_set':
>> drivers/net/ethernet/mellanox/mlx4/en_netdev.c:2752:4: error: implicit declaration of function 'bpf_prog_add_undo' [-Werror=implicit-function-declaration]
       bpf_prog_add_undo(prog, priv->rx_ring_num - 1);
       ^~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/bpf_prog_add_undo +2752 drivers/net/ethernet/mellanox/mlx4/en_netdev.c

  2746			en_warn(priv, "Reducing the number of TX rings, to not exceed the max total rings number.\n");
  2747		}
  2748	
  2749		err = mlx4_en_try_alloc_resources(priv, tmp, &new_prof);
  2750		if (err) {
  2751			if (prog)
> 2752				bpf_prog_add_undo(prog, priv->rx_ring_num - 1);
  2753			goto unlock_out;
  2754		}
  2755	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28646 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 1/2] bpf, mlx4: fix prog refcount in mlx4_en_try_alloc_resources error path
From: Daniel Borkmann @ 2016-11-09 11:04 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Zhiyi Sun, bblanco, Tariq Toukan, Yishai Hadas,
	netdev, linux-rdma, linux-kernel, alexei.starovoitov
In-Reply-To: <201611091853.HAp072gP%fengguang.wu@intel.com>

On 11/09/2016 11:58 AM, kbuild test robot wrote:
[...]
> All errors (new ones prefixed by >>):
>
>     drivers/net/ethernet/mellanox/mlx4/en_netdev.c: In function 'mlx4_xdp_set':
>>> drivers/net/ethernet/mellanox/mlx4/en_netdev.c:2752:4: error: implicit declaration of function 'bpf_prog_add_undo' [-Werror=implicit-function-declaration]
>         bpf_prog_add_undo(prog, priv->rx_ring_num - 1);
>         ^~~~~~~~~~~~~~~~~
>     cc1: some warnings being treated as errors

Ahh right, needs an empty variant for !CONFIG_BPF_SYSCALL. I'll fix that up
before sending an official patch.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH rdma-core 1/7] libhns: Add initial main frame
From: oulijun @ 2016-11-09 13:10 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linuxarm-hv44wF8Li93QT0dZR+AlfA
In-Reply-To: <20161108125441.GB27883-2ukJVAZIZ/Y@public.gmane.org>

在 2016/11/8 20:54, Leon Romanovsky 写道:
> On Mon, Nov 07, 2016 at 04:15:32PM -0700, Jason Gunthorpe wrote:
>> On Sat, Oct 29, 2016 at 09:16:25AM +0800, oulijun wrote:
>>
>>> We hope that the only one userspace library file named
>>> libhns-rdmav2.so will be used for the different hardware
>>> version(hip06, hip07, ...), because there are only little change
>>> between their userspace drivers. So we need to distinguish hardware
>>> version.
>>
>> I guess that makes sense, but you still need to be able to parse dt
>> compatible strings that are lists.
> 
> IMHO, it can be easily done as follow up patches.
> 
Hi, Leon & Jason
   We hope that the only one userspace library file named libhns-rdmav2.so will be used for the different hardware version(hip06, hip07, ...),
because there are only little change between their userspace drivers. So we need to distinguish hardware version.
We can't distinguish them if only matching driver name "hns_roce".

It will be matched it when appeared the second hard version, the code will be fixed as follows:
 firstly, we will add a hca_table structure:
 static struct {
	unsigned int		vendor;
	unsigned int		device;
	void			*data;
	int			version;
} hca_table[] = {
	{PCI_VENDOR_ID_HISILICON, 0xA223, &hns_roce_u_hw_v2, HNS_ROCE_HW_VER2},
	{PCI_VENDOR_ID_HISILICON, 0xA224, &hns_roce_u_hw_v2, HNS_ROCE_HW_VER2},
	{PCI_VENDOR_ID_HISILICON, 0xA225, &hns_roce_u_hw_v2, HNS_ROCE_HW_VER2},
	{PCI_VENDOR_ID_HISILICON, 0xA226, &hns_roce_u_hw_v2, HNS_ROCE_HW_VER2},
	{PCI_VENDOR_ID_HISILICON, 0xA227, &hns_roce_u_hw_v2, HNS_ROCE_HW_VER2},
	{PCI_VENDOR_ID_HISILICON, 0xA22F, &hns_roce_u_hw_v2, HNS_ROCE_HW_VER2},
};

second, we will distinguish with it by hca_table[]:
   if (ibv_read_sysfs_file(uverbs_sys_path, "device/modalias",
				value, sizeof(value)) > 0)
		for (i = 0; i < sizeof(acpi_table) / sizeof(acpi_table[0]); ++i)
			if (!strcmp(value, acpi_table[i].hid)) {
				u_hw = acpi_table[i].data;
				hw_version = acpi_table[i].version;
				goto found;
			}

	if (ibv_read_sysfs_file(uverbs_sys_path, "device/of_node/compatible",
				value, sizeof(value)) > 0)
		for (i = 0; i < sizeof(dt_table) / sizeof(dt_table[0]); ++i)
			if (!strcmp(value, dt_table[i].compatible)) {
				u_hw = dt_table[i].data;
				hw_version = dt_table[i].version;
				goto found;
			}

	if (ibv_read_sysfs_file(uverbs_sys_path, "device/device", value,
				sizeof(value)) < 0)
		return NULL;

	sscanf(value, "%i", &vendor);

	if (ibv_read_sysfs_file(uverbs_sys_path, "device/vendor", value,
				sizeof(value)) < 0)
		return NULL;

	sscanf(value, "%i", &vendor);

	for (i = 0; i < sizeof(hca_table) / sizeof(hca_table[0]); ++i)
		if (vendor == hca_table[i].vendor &&
		    device == hca_table[i].device)
			goto found;

for using the path "device/of_node/compatible" when startup by DT method with hip06,
the content of compatible can only match with the device id and name in the hns-roce.ko:
static const struct of_device_id hns_roce_of_match[] = {
	{ .compatible = "hisilicon,hns-roce-v1", .data = &hns_roce_hw_v1, },
	{},
};
hence, we think that it will be distinguished by found the string("hisilicon, hns-roce-v1") in
/../../../device/of_node/compatible

When the userspace library of hns support hip07 or hip 08, the so file is still libhns-rdmav2.so and
will be used simultaneously for hip06 and hip07 or hip08 etc.

Lijun Ou
>>
>> Jason
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v4 9/9] selinux: Add a cache for quicker retreival of PKey SIDs
From: Daniel Jurgens @ 2016-11-09 14:03 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: chrisw-69jw2NvuJkxg9hUCZPvPmw@public.gmane.org,
	paul-r2n+y4ga6xFZroRs9YW3xA@public.gmane.org,
	sds-+05T5uksL2qpZYMLLGbcSA@public.gmane.org,
	eparis-FjpueFixGhCM4zKIHC2jIg@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org,
	hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	selinux-+05T5uksL2qpZYMLLGbcSA@public.gmane.org,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Yevgeny Petrilin, Liran Liss
In-Reply-To: <20161109070455.GF27883@leon.nu>

On 11/9/2016 1:05 AM, Leon Romanovsky wrote:
> On Tue, Nov 08, 2016 at 11:06:25PM +0200, Dan Jurgens wrote:
>> From: Daniel Jurgens <danielj-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>
>> It is likely that the SID for the same PKey will be requested many
>> times. To reduce the time to modify QPs and process MADs use a cache to
>> store PKey SIDs.
>>
>> This code is heavily based on the "netif" and "netport" concept
>> originally developed by James Morris <jmorris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> and Paul Moore
>> <paul-r2n+y4ga6xFZroRs9YW3xA@public.gmane.org> (see security/selinux/netif.c and
>> security/selinux/netport.c for more information)
>>
>> issue: 736423
>> Change-Id: I176c3079d5d84d06839b4f750100ac47a6081e94
> It doesn't belong to commit message.
>
>> Signed-off-by: Daniel Jurgens <danielj-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Yes, sorry silly oversight on my part.  I will address for all patches in v5.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* wireshark's RPC-over-RDMA dissector
From: Chuck Lever @ 2016-11-09 16:05 UTC (permalink / raw)
  To: Linux NFS Mailing List, List Linux RDMA Mailing

Hi-

Thanks to Yan Berman, for a couple of years now we've had a basic
RPC-over-RDMA dissector in wireshark that can be used with ibdump
captures. There have been some bugs noted, but no-one has had the
cycles to dig in and address.

Recently Tom Haynes helped me set up my own wireshark build so I
could help address some of the known issues.

http://git.linux-nfs.org/?p=cel/wireshark.git;a=shortlog;h=refs/heads/rpc-rdma-fixes

Posting here for review before I take the next steps to push these
to the wireshark community. Constructive critique and other
suggestions are welcome.

The fixes so far focus on dissecting transport headers correctly.
There continue to be significant open issues with the dissector:
	• There does not appear to be any support for dissecting
	  RPC-over-RDMA on iWARP or RoCE
	• The NFS dissector does not handle portions of the XDR
	  stream that were transmitted via RDMA Read/Write
	• RPC messages conveyed via RDMA_NOMSG are not recognized
	  or dissected
	• There is no association between RDMA Reads and Writes
	  and the RPC-over-RDMA message they go with
	• A CREQ / CREP pair are needed to identify which QP
	  numbers are used for RPC-over-RDMA traffic
	• With TCP, the dissector fully outdents the RPC and NFS
	  dissection results; but with RDMA, the dissector places
	  the results in the tree under the Infiniband header
	• Not enough error detection in the dissector

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: wireshark's RPC-over-RDMA dissector
From: Parav Pandit @ 2016-11-09 16:41 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Linux NFS Mailing List, List Linux RDMA Mailing
In-Reply-To: <2975B49B-2696-4E2A-B9D0-2D6CB607EC59-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

Hi Chuck,

Just FYI.
I have committed few fixes in wireshark trunk for RoCE and IB
dissectors which will in general benefit other ULPs as well (primary
for statefulness of ULPs) last month.

I have few patches pending in my sandbox in area of RoCE and for other
ULP that I will push in coming days, likely next week.
I am currently actively testing them and have made steady progress so far.

I will try to find sometime to review them next week.

Regards,
Parav Pandit


On Wed, Nov 9, 2016 at 9:35 PM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> Hi-
>
> Thanks to Yan Berman, for a couple of years now we've had a basic
> RPC-over-RDMA dissector in wireshark that can be used with ibdump
> captures. There have been some bugs noted, but no-one has had the
> cycles to dig in and address.
>
> Recently Tom Haynes helped me set up my own wireshark build so I
> could help address some of the known issues.
>
> http://git.linux-nfs.org/?p=cel/wireshark.git;a=shortlog;h=refs/heads/rpc-rdma-fixes
>
> Posting here for review before I take the next steps to push these
> to the wireshark community. Constructive critique and other
> suggestions are welcome.
>
> The fixes so far focus on dissecting transport headers correctly.
> There continue to be significant open issues with the dissector:
>         • There does not appear to be any support for dissecting
>           RPC-over-RDMA on iWARP or RoCE
>         • The NFS dissector does not handle portions of the XDR
>           stream that were transmitted via RDMA Read/Write
>         • RPC messages conveyed via RDMA_NOMSG are not recognized
>           or dissected
>         • There is no association between RDMA Reads and Writes
>           and the RPC-over-RDMA message they go with
>         • A CREQ / CREP pair are needed to identify which QP
>           numbers are used for RPC-over-RDMA traffic
>         • With TCP, the dissector fully outdents the RPC and NFS
>           dissection results; but with RDMA, the dissector places
>           the results in the tree under the Infiniband header
>         • Not enough error detection in the dissector
>
> --
> Chuck Lever
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net/mlx4_en: Fix bpf_prog_add ref_cnt in mlx4
From: Brenden Blanco @ 2016-11-09 17:06 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Zhiyi Sun, Tariq Toukan, Yishai Hadas, netdev, linux-rdma,
	linux-kernel, alexei.starovoitov
In-Reply-To: <5822F30C.1050900@iogearbox.net>

On Wed, Nov 09, 2016 at 10:57:32AM +0100, Daniel Borkmann wrote:
> On 11/09/2016 10:45 AM, Zhiyi Sun wrote:
> >On Wed, Nov 09, 2016 at 10:05:31AM +0100, Daniel Borkmann wrote:
> >>On 11/09/2016 08:35 AM, Zhiyi Sun wrote:
> >>>There are rx_ring_num queues. Each queue will load xdp prog. So
> >>>bpf_prog_add() should add rx_ring_num to ref_cnt.
> >>>
> >>>Signed-off-by: Zhiyi Sun <zhiyisun@gmail.com>
> >>
> >>Your analysis looks incorrect to me. Please elaborate in more detail why
> >>you think current code is buggy ...
> >
> >Yes, you are correct. My patch is incorrect. It is not a bug.
> >
> >>Call path is dev_change_xdp_fd(), which does bpf_prog_get_type() on the
> >>fd. This already takes a ref and only drops it in case of error. Thus
> >>in mlx4_xdp_set(), you only need priv->rx_ring_num - 1 refs for the rest
> >>of the rings, so that dropping refs from old_prog makes sure we release
> >>it again. Looks correct to me (maybe a comment would have helped there).
> >
> >I thought mlx4's code is incorrect because in mlx5's driver, function
> >mlx5e_xdp_set() calls a pair of bpf_prog_add/put, the number of add and
> >put to the refs are same. I didn't notice that one "add" has been called in its
> >calller. So, it seems that mlx5's code is incorrect, right?
> 
> Yep, I think the two attached patches are needed.
> 
> The other thing I noticed in mlx5e_create_rq() is that it calls
> bpf_prog_add(rq->xdp_prog, 1) without actually checking for errors.

> From d2bd6b3cd8636716a06b0ea3b1e041e16f87cce0 Mon Sep 17 00:00:00 2001
> Message-Id: <d2bd6b3cd8636716a06b0ea3b1e041e16f87cce0.1478685278.git.daniel@iogearbox.net>
> From: Daniel Borkmann <daniel@iogearbox.net>
> Date: Wed, 9 Nov 2016 10:31:19 +0100
> Subject: [PATCH net-next 1/2] bpf, mlx4: fix prog refcount in mlx4_en_try_alloc_resources error path
> 
> Commit 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings
> scheme") added a bug in that the prog's reference count is not dropped
> in the error path when mlx4_en_try_alloc_resources() is failing.
> 
> We previously took bpf_prog_add(prog, priv->rx_ring_num - 1), that we
> need to release again. Earlier in the call-path, dev_change_xdp_fd()
> itself holds a ref to the prog as well, which is then released though
> bpf_prog_put() due to the propagated error.
> 
> Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  5 ++++-
>  include/linux/bpf.h                            |  1 +
>  kernel/bpf/syscall.c                           | 11 +++++++++++
>  3 files changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 0f6225c..4104aec 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2747,8 +2747,11 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
>  	}
>  
>  	err = mlx4_en_try_alloc_resources(priv, tmp, &new_prof);
> -	if (err)
> +	if (err) {
> +		if (prog)
> +			bpf_prog_add_undo(prog, priv->rx_ring_num - 1);
Why not just move the above bpf_prog_add to be below the try_alloc?
Nobody needs those references until all of the resources have been
allocated, and then we can remove the need for bpf_prog_add_undo.
>  		goto unlock_out;
> +	}
>  
>  	if (priv->port_up) {
>  		port_up = 1;
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index edcd96d..4f6a4f1 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -234,6 +234,7 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
>  struct bpf_prog *bpf_prog_get(u32 ufd);
>  struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type);
>  struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i);
> +void bpf_prog_add_undo(struct bpf_prog *prog, int i);
>  struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog);
>  void bpf_prog_put(struct bpf_prog *prog);
>  
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 228f962..a6e4dd8 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -680,6 +680,17 @@ struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i)
>  }
>  EXPORT_SYMBOL_GPL(bpf_prog_add);
>  
> +void bpf_prog_add_undo(struct bpf_prog *prog, int i)
> +{
> +	/* Only to be used for undoing previous bpf_prog_add() in some
> +	 * error path. We still know that another entity in our call
> +	 * path holds a reference to the program, thus atomic_sub() can
> +	 * be safely used here!
> +	 */
> +	atomic_sub(i, &prog->aux->refcnt);
> +}
> +EXPORT_SYMBOL_GPL(bpf_prog_add_undo);
> +
>  struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
>  {
>  	return bpf_prog_add(prog, 1);
> -- 
> 1.9.3
> 

> From f0789544432bbb89c53c3b8ac6575d48fed97786 Mon Sep 17 00:00:00 2001
> Message-Id: <f0789544432bbb89c53c3b8ac6575d48fed97786.1478685278.git.daniel@iogearbox.net>
> In-Reply-To: <d2bd6b3cd8636716a06b0ea3b1e041e16f87cce0.1478685278.git.daniel@iogearbox.net>
> References: <d2bd6b3cd8636716a06b0ea3b1e041e16f87cce0.1478685278.git.daniel@iogearbox.net>
> From: Daniel Borkmann <daniel@iogearbox.net>
> Date: Wed, 9 Nov 2016 10:51:26 +0100
> Subject: [PATCH net-next 2/2] bpf, mlx5: fix prog refcount in mlx5e_xdp_set
> 
> dev_change_xdp_fd() already holds a reference, so bpf_prog_add(prog, 1)
> is not correct as it takes one reference too much and will thus leak
> the prog eventually. Also, bpf_prog_add() can fail and is not checked
> for errors here.
> 
> Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index ba0c774..63309dd 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -3121,8 +3121,6 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
>  
>  	/* exchange programs */
>  	old_prog = xchg(&priv->xdp_prog, prog);
> -	if (prog)
> -		bpf_prog_add(prog, 1);
There is also another use of bpf_prog_add down below, which does not
check the error return. Same in mlx5e_create_rq.
>  	if (old_prog)
>  		bpf_prog_put(old_prog);
>  
> -- 
> 1.9.3
> 

^ permalink raw reply

* RE: [PATCH rdma-next 0/4] Add packet pacing support for IB verbs
From: Hefty, Sean @ 2016-11-09 17:06 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20161109064009.GE27883-2ukJVAZIZ/Y@public.gmane.org>

> On Tue, Nov 08, 2016 at 05:49:26PM +0000, Hefty, Sean wrote:
> > > When sending from a 10G host to a 1G host, it is easy to overrun
> the
> > > receiver,
> > > leading to packet loss and traffic backing off. Similar problems
> occur
> > > when
> > > a 10G host sends data to a sub-10G virtual circuit, or a 40G host
> > > sending
> > > to a 10G host. Packet pacing could control packet injection rate
> and
> > > reduces
> > > network congestion to maximize throughput & minimize network
> latency.
> >
> > Why isn't the path record data and existing mechanisms sufficient to
> handle this?
> >
> 
> Packet pacing allows different combinations of traffic shaping: per-
> CPU,
> per-flow and their combinations with better and steady QoS requirements
> without involving subnet management.

The patch adds this as a QP attribute, and we already have a rate for that.  I still don't see why the standard mechanisms are insufficient or couldn't be adapted.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH rdma-next 2/4] IB/core: Support rate limit for packet pacing
From: Hefty, Sean @ 2016-11-09 17:27 UTC (permalink / raw)
  To: Leon Romanovsky, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Bodong Wang
In-Reply-To: <1477909297-14491-3-git-send-email-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

>  enum ib_qp_state {
> @@ -1151,6 +1152,7 @@ struct ib_qp_attr {
>  	u8			rnr_retry;
>  	u8			alt_port_num;
>  	u8			alt_timeout;
> +	u32			rate_limit;
>  };

We already have ib_qp_attr::ib_ah_attr::static_rate, and that accounts for both the primary and alternate paths.  We should not add a conflicting rate_limit field.  Either use static_rate as defined by the spec, or replace/update it, with corresponding changes to how it is used in conjunction with SM data.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v1 0/7] server-side NFS/RDMA patches proposed for v4.10
From: Chuck Lever @ 2016-11-09 17:33 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

I'd like to propose these server-side changes for v4.10. They
include:

- Drop connection on GSS sequence window overflow
- Remove unnecessary spin lock in the svc_rdma_send path
- A number of minor clean-ups

Available in the "nfsd-rdma-for-4.10" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git


Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-for-4.10


Meanwhile, I've been working on converting the server-side RPC/RDMA
transport to use the new generic R/W API. The prototype for the
svc_rdma_sendto path works for some forms of the transport header,
but still has a few bugs. The svc_rdma_recvfrom path will be next,
but is an even larger task.

When this work is further along I will publish a topic branch.

---

Chuck Lever (7):
      svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit arm
      svcauth_gss: Close connection when dropping an incoming message
      svcrdma: Renovate sendto chunk list parsing
      svcrdma: Remove BH-disabled spin locking in svc_rdma_send()
      svcrdma: Remove DMA map accounting
      svcrdma: Remove svc_rdma_op_ctxt::wc_status
      svcrdma: Break up dprintk format in svc_rdma_accept()


 include/linux/sunrpc/svc_rdma.h            |    7 --
 net/sunrpc/auth_gss/svcauth_gss.c          |    2 -
 net/sunrpc/svc.c                           |   14 +++-
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    1 
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c    |   19 +++++
 net/sunrpc/xprtrdma/svc_rdma_sendto.c      |   99 +++++++++-------------------
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |   87 ++++++++-----------------
 7 files changed, 87 insertions(+), 142 deletions(-)

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v1 1/7] svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit arm
From: Chuck Lever @ 2016-11-09 17:33 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161109165814.15735.47815.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>

Logic copied from xs_setup_bc_tcp().

Fixes: 39a9beab5acb ('rpc: share one xps between all backchannels')
Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
index 20027f8..6035c5a 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -359,6 +359,7 @@ static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma,
 out_fail:
 	xprt_rdma_free_addresses(xprt);
 	args->bc_xprt->xpt_bc_xprt = NULL;
+	args->bc_xprt->xpt_bc_xps = NULL;
 	xprt_put(xprt);
 	xprt_free(xprt);
 	return ERR_PTR(-EINVAL);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v1 2/7] svcauth_gss: Close connection when dropping an incoming message
From: Chuck Lever @ 2016-11-09 17:33 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161109165814.15735.47815.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>

S5.3.3.1 of RFC 2203 requires that an incoming GSS-wrapped message
whose sequence number lies outside the current window is dropped.
The rationale is:

  The reason for discarding requests silently is that the server
  is unable to determine if the duplicate or out of range request
  was due to a sequencing problem in the client, network, or the
  operating system, or due to some quirk in routing, or a replay
  attack by an intruder.  Discarding the request allows the client
  to recover after timing out, if indeed the duplication was
  unintentional or well intended.

However, clients may rely on the server dropping the connection to
indicate that a retransmit is needed. Without a connection reset, a
client can wait forever without retransmitting, and the workload
just stops dead. I've reproduced this behavior by running xfstests
generic/323 on an NFSv4.0 mount with proto=rdma and sec=krb5i.

To address this issue, have the server close the connection when it
silently discards an incoming message due to a GSS sequence number
problem.

There are a few other places where the server will never reply.
Change those spots in a similar fashion.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/auth_gss/svcauth_gss.c |    2 +-
 net/sunrpc/svc.c                  |   14 +++++++++-----
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/net/sunrpc/auth_gss/svcauth_gss.c b/net/sunrpc/auth_gss/svcauth_gss.c
index 45662d7..886e9d38 100644
--- a/net/sunrpc/auth_gss/svcauth_gss.c
+++ b/net/sunrpc/auth_gss/svcauth_gss.c
@@ -1548,7 +1548,7 @@ static void destroy_use_gss_proxy_proc_entry(struct net *net) {}
 	ret = SVC_COMPLETE;
 	goto out;
 drop:
-	ret = SVC_DROP;
+	ret = SVC_CLOSE;
 out:
 	if (rsci)
 		cache_put(&rsci->h, sn->rsc_cache);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 7c8070e..75f290b 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1155,8 +1155,7 @@ static __printf(2,3) void svc_printk(struct svc_rqst *rqstp, const char *fmt, ..
 	case SVC_DENIED:
 		goto err_bad_auth;
 	case SVC_CLOSE:
-		if (test_bit(XPT_TEMP, &rqstp->rq_xprt->xpt_flags))
-			svc_close_xprt(rqstp->rq_xprt);
+		goto close;
 	case SVC_DROP:
 		goto dropit;
 	case SVC_COMPLETE:
@@ -1246,7 +1245,7 @@ static __printf(2,3) void svc_printk(struct svc_rqst *rqstp, const char *fmt, ..

  sendit:
 	if (svc_authorise(rqstp))
-		goto dropit;
+		goto close;
 	return 1;		/* Caller can now send it */

  dropit:
@@ -1254,11 +1253,16 @@ static __printf(2,3) void svc_printk(struct svc_rqst *rqstp, const char *fmt, ..
 	dprintk("svc: svc_process dropit\n");
 	return 0;

+ close:
+	if (test_bit(XPT_TEMP, &rqstp->rq_xprt->xpt_flags))
+		svc_close_xprt(rqstp->rq_xprt);
+	dprintk("svc: svc_process close\n");
+	return 0;
+
 err_short_len:
 	svc_printk(rqstp, "short len %Zd, dropping request\n",
 			argv->iov_len);
-
-	goto dropit;			/* drop request */
+	goto close;

 err_bad_rpc:
 	serv->sv_stats->rpcbadfmt++;

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v1 3/7] svcrdma: Renovate sendto chunk list parsing
From: Chuck Lever @ 2016-11-09 17:34 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161109165814.15735.47815.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>

The current sendto code appears to support clients that provide only
one of a Read list, a Write list, or a Reply chunk. My reading of
that code is that it doesn't support the following cases:

 - Read list + Write list
 - Read list + Reply chunk
 - Write list + Reply chunk
 - Read list + Write list + Reply chunk

The protocol allows more than one Read or Write chunk in those
lists. Some clients do send a Read list and Reply chunk
simultaneously. NFSv4 WRITE uses a Read list for the data payload,
and a Reply chunk because the GETATTR result in the reply can
contain a large object like an ACL.

Generalize one of the sendto code paths needed to support all of
the above cases, and attempt to ensure that only one pass is done
through the RPC Call's transport header to gather chunk list
information for building the reply.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/svc_rdma.h         |    2 -
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   14 +++++
 net/sunrpc/xprtrdma/svc_rdma_sendto.c   |   92 ++++++++-----------------------
 3 files changed, 39 insertions(+), 69 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index cc3ae16..6aef63b 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -236,8 +236,6 @@ extern int rdma_read_chunk_frmr(struct svcxprt_rdma *, struct svc_rqst *,
 extern int svc_rdma_map_xdr(struct svcxprt_rdma *, struct xdr_buf *,
 			    struct svc_rdma_req_map *, bool);
 extern int svc_rdma_sendto(struct svc_rqst *);
-extern struct rpcrdma_read_chunk *
-	svc_rdma_get_read_chunk(struct rpcrdma_msg *);
 extern void svc_rdma_send_error(struct svcxprt_rdma *, struct rpcrdma_msg *,
 				int);
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index ad1df97..5ac93cb 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -415,6 +415,20 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
 	return 1;
 }
 
+/* Returns the address of the first read chunk or <nul> if no read chunk
+ * is present
+ */
+struct rpcrdma_read_chunk *
+svc_rdma_get_read_chunk(struct rpcrdma_msg *rmsgp)
+{
+	struct rpcrdma_read_chunk *ch =
+		(struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
+
+	if (ch->rc_discrim == xdr_zero)
+		return NULL;
+	return ch;
+}
+
 static int rdma_read_chunks(struct svcxprt_rdma *xprt,
 			    struct rpcrdma_msg *rmsgp,
 			    struct svc_rqst *rqstp,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index f5a91ed..0a58d40 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -153,76 +153,35 @@ static dma_addr_t dma_map_xdr(struct svcxprt_rdma *xprt,
 	return dma_addr;
 }
 
-/* Returns the address of the first read chunk or <nul> if no read chunk
- * is present
+/* Parse the RPC Call's transport header.
  */
-struct rpcrdma_read_chunk *
-svc_rdma_get_read_chunk(struct rpcrdma_msg *rmsgp)
+static void svc_rdma_get_write_arrays(struct rpcrdma_msg *rmsgp,
+				      struct rpcrdma_write_array **write,
+				      struct rpcrdma_write_array **reply)
 {
-	struct rpcrdma_read_chunk *ch =
-		(struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
+	__be32 *p;
 
-	if (ch->rc_discrim == xdr_zero)
-		return NULL;
-	return ch;
-}
-
-/* Returns the address of the first read write array element or <nul>
- * if no write array list is present
- */
-static struct rpcrdma_write_array *
-svc_rdma_get_write_array(struct rpcrdma_msg *rmsgp)
-{
-	if (rmsgp->rm_body.rm_chunks[0] != xdr_zero ||
-	    rmsgp->rm_body.rm_chunks[1] == xdr_zero)
-		return NULL;
-	return (struct rpcrdma_write_array *)&rmsgp->rm_body.rm_chunks[1];
-}
+	p = (__be32 *)&rmsgp->rm_body.rm_chunks[0];
 
-/* Returns the address of the first reply array element or <nul> if no
- * reply array is present
- */
-static struct rpcrdma_write_array *
-svc_rdma_get_reply_array(struct rpcrdma_msg *rmsgp,
-			 struct rpcrdma_write_array *wr_ary)
-{
-	struct rpcrdma_read_chunk *rch;
-	struct rpcrdma_write_array *rp_ary;
+	/* Read list */
+	while (*p++ != xdr_zero)
+		p += 5;
 
-	/* XXX: Need to fix when reply chunk may occur with read list
-	 *	and/or write list.
-	 */
-	if (rmsgp->rm_body.rm_chunks[0] != xdr_zero ||
-	    rmsgp->rm_body.rm_chunks[1] != xdr_zero)
-		return NULL;
-
-	rch = svc_rdma_get_read_chunk(rmsgp);
-	if (rch) {
-		while (rch->rc_discrim != xdr_zero)
-			rch++;
-
-		/* The reply chunk follows an empty write array located
-		 * at 'rc_position' here. The reply array is at rc_target.
-		 */
-		rp_ary = (struct rpcrdma_write_array *)&rch->rc_target;
-		goto found_it;
-	}
-
-	if (wr_ary) {
-		int chunk = be32_to_cpu(wr_ary->wc_nchunks);
-
-		rp_ary = (struct rpcrdma_write_array *)
-			 &wr_ary->wc_array[chunk].wc_target.rs_length;
-		goto found_it;
+	/* Write list */
+	if (*p != xdr_zero) {
+		*write = (struct rpcrdma_write_array *)p;
+		while (*p++ != xdr_zero)
+			p += 1 + be32_to_cpu(*p) * 4;
+	} else {
+		*write = NULL;
+		p++;
 	}
 
-	/* No read list, no write list */
-	rp_ary = (struct rpcrdma_write_array *)&rmsgp->rm_body.rm_chunks[2];
-
- found_it:
-	if (rp_ary->wc_discrim == xdr_zero)
-		return NULL;
-	return rp_ary;
+	/* Reply chunk */
+	if (*p != xdr_zero)
+		*reply = (struct rpcrdma_write_array *)p;
+	else
+		*reply = NULL;
 }
 
 /* RPC-over-RDMA Version One private extension: Remote Invalidation.
@@ -244,8 +203,8 @@ static u32 svc_rdma_get_inv_rkey(struct rpcrdma_msg *rdma_argp,
 
 	inv_rkey = 0;
 
-	rd_ary = svc_rdma_get_read_chunk(rdma_argp);
-	if (rd_ary) {
+	rd_ary = (struct rpcrdma_read_chunk *)&rdma_argp->rm_body.rm_chunks[0];
+	if (rd_ary->rc_discrim != xdr_zero) {
 		inv_rkey = be32_to_cpu(rd_ary->rc_target.rs_handle);
 		goto out;
 	}
@@ -622,8 +581,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
 	 * places this at the start of page 0.
 	 */
 	rdma_argp = page_address(rqstp->rq_pages[0]);
-	wr_ary = svc_rdma_get_write_array(rdma_argp);
-	rp_ary = svc_rdma_get_reply_array(rdma_argp, wr_ary);
+	svc_rdma_get_write_arrays(rdma_argp, &wr_ary, &rp_ary);
 
 	inv_rkey = 0;
 	if (rdma->sc_snd_w_inv)

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v1 4/7] svcrdma: Remove BH-disabled spin locking in svc_rdma_send()
From: Chuck Lever @ 2016-11-09 17:34 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161109165814.15735.47815.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>

svcrdma's current SQ accounting algorithm takes sc_lock and disables
bottom-halves while posting all RDMA Read, Write, and Send WRs.

This is relatively heavyweight serialization. And note that Write and
Send are already fully serialized by the xpt_mutex.

Using a single atomic_t should be all that is necessary to guarantee
that ib_post_send() is called only when there is enough space on the
send queue. This is what the other RDMA-enabled storage targets do.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/svc_rdma.h          |    2 +-
 net/sunrpc/xprtrdma/svc_rdma_sendto.c    |    7 ++++++-
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   18 +++++++-----------
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 6aef63b..601cb07 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -139,7 +139,7 @@ struct svcxprt_rdma {
 	int                  sc_max_sge_rd;	/* max sge for read target */
 	bool		     sc_snd_w_inv;	/* OK to use Send With Invalidate */
 
-	atomic_t             sc_sq_count;	/* Number of SQ WR on queue */
+	atomic_t             sc_sq_avail;	/* SQEs ready to be consumed */
 	unsigned int	     sc_sq_depth;	/* Depth of SQ */
 	unsigned int	     sc_rq_depth;	/* Depth of RQ */
 	u32		     sc_max_requests;	/* Forward credits */
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 0a58d40..30eeab5 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -594,7 +594,12 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
 		goto err0;
 	inline_bytes = rqstp->rq_res.len;
 
-	/* Create the RDMA response header */
+	/* Create the RDMA response header. xprt->xpt_mutex,
+	 * acquired in svc_send(), serializes RPC replies. The
+	 * code path below that inserts the credit grant value
+	 * into each transport header runs only inside this
+	 * critical section.
+	 */
 	ret = -ENOMEM;
 	res_page = alloc_page(GFP_KERNEL);
 	if (!res_page)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 6864fb9..f7dd6494 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -434,7 +434,7 @@ static void svc_rdma_send_wc_common(struct svcxprt_rdma *xprt,
 		goto err;
 
 out:
-	atomic_dec(&xprt->sc_sq_count);
+	atomic_inc(&xprt->sc_sq_avail);
 	wake_up(&xprt->sc_send_wait);
 	return;
 
@@ -1008,6 +1008,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 	newxprt->sc_rq_depth = newxprt->sc_max_requests +
 			       newxprt->sc_max_bc_requests;
 	newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_rq_depth;
+	atomic_set(&newxprt->sc_sq_avail, newxprt->sc_sq_depth);
 
 	if (!svc_rdma_prealloc_ctxts(newxprt))
 		goto errout;
@@ -1333,15 +1334,13 @@ int svc_rdma_send(struct svcxprt_rdma *xprt, struct ib_send_wr *wr)
 
 	/* If the SQ is full, wait until an SQ entry is available */
 	while (1) {
-		spin_lock_bh(&xprt->sc_lock);
-		if (xprt->sc_sq_depth < atomic_read(&xprt->sc_sq_count) + wr_count) {
-			spin_unlock_bh(&xprt->sc_lock);
+		if ((atomic_sub_return(wr_count, &xprt->sc_sq_avail) < 0)) {
 			atomic_inc(&rdma_stat_sq_starve);
 
 			/* Wait until SQ WR available if SQ still full */
+			atomic_add(wr_count, &xprt->sc_sq_avail);
 			wait_event(xprt->sc_send_wait,
-				   atomic_read(&xprt->sc_sq_count) <
-				   xprt->sc_sq_depth);
+				   atomic_read(&xprt->sc_sq_avail) > wr_count);
 			if (test_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags))
 				return -ENOTCONN;
 			continue;
@@ -1351,19 +1350,16 @@ int svc_rdma_send(struct svcxprt_rdma *xprt, struct ib_send_wr *wr)
 			svc_xprt_get(&xprt->sc_xprt);
 
 		/* Bump used SQ WR count and post */
-		atomic_add(wr_count, &xprt->sc_sq_count);
 		ret = ib_post_send(xprt->sc_qp, wr, &bad_wr);
 		if (ret) {
 			set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
-			atomic_sub(wr_count, &xprt->sc_sq_count);
 			for (i = 0; i < wr_count; i ++)
 				svc_xprt_put(&xprt->sc_xprt);
 			dprintk("svcrdma: failed to post SQ WR rc=%d, "
-			       "sc_sq_count=%d, sc_sq_depth=%d\n",
-			       ret, atomic_read(&xprt->sc_sq_count),
+			       "sc_sq_avail=%d, sc_sq_depth=%d\n",
+			       ret, atomic_read(&xprt->sc_sq_avail),
 			       xprt->sc_sq_depth);
 		}
-		spin_unlock_bh(&xprt->sc_lock);
 		if (ret)
 			wake_up(&xprt->sc_send_wait);
 		break;

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v1 5/7] svcrdma: Remove DMA map accounting
From: Chuck Lever @ 2016-11-09 17:34 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161109165814.15735.47815.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>

Clean up: sc_dma_used is not required for correct operation. It is
simply a debugging tool to report when svcrdma has leaked DMA maps.

However, manipulating an atomic has a measurable CPU cost, and DMA
map accounting specific to svcrdma will be meaningless once svcrdma
is converted to use the new generic r/w API.

A similar kind of debug accounting can be done simply by enabling
the IOMMU or by using CONFIG_DMA_API_DEBUG, CONFIG_IOMMU_DEBUG, and
CONFIG_IOMMU_LEAK.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/svc_rdma.h          |    2 --
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |    1 -
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   13 +++----------
 3 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 601cb07..43d7c70 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -148,7 +148,6 @@ struct svcxprt_rdma {
 
 	struct ib_pd         *sc_pd;
 
-	atomic_t	     sc_dma_used;
 	spinlock_t	     sc_ctxt_lock;
 	struct list_head     sc_ctxts;
 	int		     sc_ctxt_used;
@@ -200,7 +199,6 @@ static inline void svc_rdma_count_mappings(struct svcxprt_rdma *rdma,
 					   struct svc_rdma_op_ctxt *ctxt)
 {
 	ctxt->mapped_sges++;
-	atomic_inc(&rdma->sc_dma_used);
 }
 
 /* svc_rdma_backchannel.c */
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 5ac93cb..bf18811 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -279,7 +279,6 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
 		       frmr->sg);
 		return -ENOMEM;
 	}
-	atomic_inc(&xprt->sc_dma_used);
 
 	n = ib_map_mr_sg(frmr->mr, frmr->sg, frmr->sg_nents, NULL, PAGE_SIZE);
 	if (unlikely(n != frmr->sg_nents)) {
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index f7dd6494..ceff814 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -224,25 +224,22 @@ void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt)
 	struct svcxprt_rdma *xprt = ctxt->xprt;
 	struct ib_device *device = xprt->sc_cm_id->device;
 	u32 lkey = xprt->sc_pd->local_dma_lkey;
-	unsigned int i, count;
+	unsigned int i;
 
-	for (count = 0, i = 0; i < ctxt->mapped_sges; i++) {
+	for (i = 0; i < ctxt->mapped_sges; i++) {
 		/*
 		 * Unmap the DMA addr in the SGE if the lkey matches
 		 * the local_dma_lkey, otherwise, ignore it since it is
 		 * an FRMR lkey and will be unmapped later when the
 		 * last WR that uses it completes.
 		 */
-		if (ctxt->sge[i].lkey == lkey) {
-			count++;
+		if (ctxt->sge[i].lkey == lkey)
 			ib_dma_unmap_page(device,
 					    ctxt->sge[i].addr,
 					    ctxt->sge[i].length,
 					    ctxt->direction);
-		}
 	}
 	ctxt->mapped_sges = 0;
-	atomic_sub(count, &xprt->sc_dma_used);
 }
 
 void svc_rdma_put_context(struct svc_rdma_op_ctxt *ctxt, int free_pages)
@@ -944,7 +941,6 @@ void svc_rdma_put_frmr(struct svcxprt_rdma *rdma,
 	if (frmr) {
 		ib_dma_unmap_sg(rdma->sc_cm_id->device,
 				frmr->sg, frmr->sg_nents, frmr->direction);
-		atomic_dec(&rdma->sc_dma_used);
 		spin_lock_bh(&rdma->sc_frmr_q_lock);
 		WARN_ON_ONCE(!list_empty(&frmr->frmr_list));
 		list_add(&frmr->frmr_list, &rdma->sc_frmr_q);
@@ -1256,9 +1252,6 @@ static void __svc_rdma_free(struct work_struct *work)
 	if (rdma->sc_ctxt_used != 0)
 		pr_err("svcrdma: ctxt still in use? (%d)\n",
 		       rdma->sc_ctxt_used);
-	if (atomic_read(&rdma->sc_dma_used) != 0)
-		pr_err("svcrdma: dma still in use? (%d)\n",
-		       atomic_read(&rdma->sc_dma_used));
 
 	/* Final put of backchannel client transport */
 	if (xprt->xpt_bc_xprt) {

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v1 6/7] svcrdma: Remove svc_rdma_op_ctxt::wc_status
From: Chuck Lever @ 2016-11-09 17:34 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161109165814.15735.47815.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>

Clean up: Completion status is already reported in the individual
completion handlers. Save a few bytes in struct svc_rdma_op_ctxt.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sunrpc/svc_rdma.h          |    1 -
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |    4 ++--
 net/sunrpc/xprtrdma/svc_rdma_transport.c |    1 -
 3 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 43d7c70..757fb96 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -79,7 +79,6 @@ struct svc_rdma_op_ctxt {
 	struct ib_cqe reg_cqe;
 	struct ib_cqe inv_cqe;
 	struct list_head dto_q;
-	enum ib_wc_status wc_status;
 	u32 byte_len;
 	u32 position;
 	struct svcxprt_rdma *xprt;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index bf18811..02215d6 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -640,8 +640,8 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
 			goto defer;
 		goto out;
 	}
-	dprintk("svcrdma: processing ctxt=%p on xprt=%p, rqstp=%p, status=%d\n",
-		ctxt, rdma_xprt, rqstp, ctxt->wc_status);
+	dprintk("svcrdma: processing ctxt=%p on xprt=%p, rqstp=%p\n",
+		ctxt, rdma_xprt, rqstp);
 	atomic_inc(&rdma_stat_recv);
 
 	/* Build up the XDR from the receive buffers. */
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index ceff814..b7c9359 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -393,7 +393,6 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
 
 	/* WARNING: Only wc->wr_cqe and wc->status are reliable */
 	ctxt = container_of(cqe, struct svc_rdma_op_ctxt, cqe);
-	ctxt->wc_status = wc->status;
 	svc_rdma_unmap_dma(ctxt);
 
 	if (wc->status != IB_WC_SUCCESS)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v1 7/7] svcrdma: Break up dprintk format in svc_rdma_accept()
From: Chuck Lever @ 2016-11-09 17:34 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161109165814.15735.47815.stgit-Hs+gFlyCn65vLzlybtyyYzGyq/o6K9yX@public.gmane.org>

The current code results in:

Nov  7 14:50:19 klimt kernel: svcrdma: newxprt->sc_cm_id=ffff88085590c800, newxprt->sc_pd=ffff880852a7ce00#012    cm_id->device=ffff88084dd20000, sc_pd->device=ffff88084dd20000#012    cap.max_send_wr = 272#012    cap.max_recv_wr = 34#012    cap.max_send_sge = 32#012    cap.max_recv_sge = 32
Nov  7 14:50:19 klimt kernel: svcrdma: new connection ffff880855908000 accepted with the following attributes:#012    local_ip        : 10.0.0.5#012    local_port#011     : 20049#012    remote_ip       : 10.0.0.2#012    remote_port     : 59909#012    max_sge         : 32#012    max_sge_rd      : 30#012    sq_depth        : 272#012    max_requests    : 32#012    ord             : 16

Split up the output over multiple dprintks and take the opportunity
to fix the display of IPv6 addresses.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   55 ++++++++++--------------------
 1 file changed, 18 insertions(+), 37 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index b7c9359..8efe452 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -41,6 +41,7 @@
  */
 
 #include <linux/sunrpc/svc_xprt.h>
+#include <linux/sunrpc/addr.h>
 #include <linux/sunrpc/debug.h>
 #include <linux/sunrpc/rpc_rdma.h>
 #include <linux/interrupt.h>
@@ -966,6 +967,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 	struct rpcrdma_connect_private pmsg;
 	struct ib_qp_init_attr qp_attr;
 	struct ib_device *dev;
+	struct sockaddr *sap;
 	unsigned int i;
 	int ret = 0;
 
@@ -1046,18 +1048,12 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 	qp_attr.qp_type = IB_QPT_RC;
 	qp_attr.send_cq = newxprt->sc_sq_cq;
 	qp_attr.recv_cq = newxprt->sc_rq_cq;
-	dprintk("svcrdma: newxprt->sc_cm_id=%p, newxprt->sc_pd=%p\n"
-		"    cm_id->device=%p, sc_pd->device=%p\n"
-		"    cap.max_send_wr = %d\n"
-		"    cap.max_recv_wr = %d\n"
-		"    cap.max_send_sge = %d\n"
-		"    cap.max_recv_sge = %d\n",
-		newxprt->sc_cm_id, newxprt->sc_pd,
-		dev, newxprt->sc_pd->device,
-		qp_attr.cap.max_send_wr,
-		qp_attr.cap.max_recv_wr,
-		qp_attr.cap.max_send_sge,
-		qp_attr.cap.max_recv_sge);
+	dprintk("svcrdma: newxprt->sc_cm_id=%p, newxprt->sc_pd=%p\n",
+		newxprt->sc_cm_id, newxprt->sc_pd);
+	dprintk("    cap.max_send_wr = %d, cap.max_recv_wr = %d\n",
+		qp_attr.cap.max_send_wr, qp_attr.cap.max_recv_wr);
+	dprintk("    cap.max_send_sge = %d, cap.max_recv_sge = %d\n",
+		qp_attr.cap.max_send_sge, qp_attr.cap.max_recv_sge);
 
 	ret = rdma_create_qp(newxprt->sc_cm_id, newxprt->sc_pd, &qp_attr);
 	if (ret) {
@@ -1140,31 +1136,16 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 		goto errout;
 	}
 
-	dprintk("svcrdma: new connection %p accepted with the following "
-		"attributes:\n"
-		"    local_ip        : %pI4\n"
-		"    local_port	     : %d\n"
-		"    remote_ip       : %pI4\n"
-		"    remote_port     : %d\n"
-		"    max_sge         : %d\n"
-		"    max_sge_rd      : %d\n"
-		"    sq_depth        : %d\n"
-		"    max_requests    : %d\n"
-		"    ord             : %d\n",
-		newxprt,
-		&((struct sockaddr_in *)&newxprt->sc_cm_id->
-			 route.addr.src_addr)->sin_addr.s_addr,
-		ntohs(((struct sockaddr_in *)&newxprt->sc_cm_id->
-		       route.addr.src_addr)->sin_port),
-		&((struct sockaddr_in *)&newxprt->sc_cm_id->
-			 route.addr.dst_addr)->sin_addr.s_addr,
-		ntohs(((struct sockaddr_in *)&newxprt->sc_cm_id->
-		       route.addr.dst_addr)->sin_port),
-		newxprt->sc_max_sge,
-		newxprt->sc_max_sge_rd,
-		newxprt->sc_sq_depth,
-		newxprt->sc_max_requests,
-		newxprt->sc_ord);
+	dprintk("svcrdma: new connection %p accepted:\n", newxprt);
+	sap = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.src_addr;
+	dprintk("    local address   : %pIS:%u\n", sap, rpc_get_port(sap));
+	sap = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
+	dprintk("    remote address  : %pIS:%u\n", sap, rpc_get_port(sap));
+	dprintk("    max_sge         : %d\n", newxprt->sc_max_sge);
+	dprintk("    max_sge_rd      : %d\n", newxprt->sc_max_sge_rd);
+	dprintk("    sq_depth        : %d\n", newxprt->sc_sq_depth);
+	dprintk("    max_requests    : %d\n", newxprt->sc_max_requests);
+	dprintk("    ord             : %d\n", newxprt->sc_ord);
 
 	return &newxprt->sc_xprt;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH rdma-core 0/8] libpvrdma: userspace library for PVRDMA
From: Adit Ranadive @ 2016-11-09 17:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Doug Ledford, Jason Gunthorpe,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, pv-drivers
In-Reply-To: <20161109011715.GA29310-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>

On Tue, Nov 08, 2016 at 5:17:15PM -0800, Christoph Hellwig wrote:
> FYI, the convention used in scsi is vmw_pvscsi.c, so naming the
> RDMA equivalent vmw_pvrdma would make a lot of sense and still
> be reasonably short.

Thanks Christoph. We agree with that as well. Then the driver path
(drivers/infiniband/hw/vmw_pvrdma/...) and the module name would
reflect vmw_pvrdma.
What about the user library? libpvrdma -> libvmwpvrdma?

Thanks,
Adit

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH TRIVIAL infiniband-diags] libibnetdisc/internal.h: Remove duplicated declaration of add_to_portlid_hash
From: Hal Rosenstock @ 2016-11-09 17:48 UTC (permalink / raw)
  To: Weiny, Ira; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org


Signed-off-by: Hal Rosenstock <hal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
diff --git a/libibnetdisc/src/internal.h b/libibnetdisc/src/internal.h
index 5a32cd2..997f439 100644
--- a/libibnetdisc/src/internal.h
+++ b/libibnetdisc/src/internal.h
@@ -109,7 +109,6 @@ void smp_engine_destroy(smp_engine_t * engine);
 int add_to_nodeguid_hash(ibnd_node_t * node, ibnd_node_t * hash[]);
 
 int add_to_portguid_hash(ibnd_port_t * port, ibnd_port_t * hash[]);
-void add_to_portlid_hash(ibnd_port_t * port, GHashTable *htable);
 
 void add_to_type_list(ibnd_node_t * node, f_internal_t * fabric);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH rdma-core 0/8] libpvrdma: userspace library for PVRDMA
From: Jason Gunthorpe @ 2016-11-09 17:51 UTC (permalink / raw)
  To: Adit Ranadive
  Cc: Christoph Hellwig, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, pv-drivers
In-Reply-To: <b49161b1-75f6-6902-fced-355804b885bb-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>

On Wed, Nov 09, 2016 at 09:39:49AM -0800, Adit Ranadive wrote:
> On Tue, Nov 08, 2016 at 5:17:15PM -0800, Christoph Hellwig wrote:
> > FYI, the convention used in scsi is vmw_pvscsi.c, so naming the
> > RDMA equivalent vmw_pvrdma would make a lot of sense and still
> > be reasonably short.
> 
> Thanks Christoph. We agree with that as well. Then the driver path
> (drivers/infiniband/hw/vmw_pvrdma/...) and the module name would
> reflect vmw_pvrdma.
> What about the user library? libpvrdma -> libvmwpvrdma?

I am encouraging people to match the name of the module in the
library.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH rdma-core 0/8] libpvrdma: userspace library for PVRDMA
From: Jason Gunthorpe @ 2016-11-09 17:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Adit Ranadive, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, pv-drivers
In-Reply-To: <20161109012335.GA29658-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>

On Tue, Nov 08, 2016 at 05:23:35PM -0800, Christoph Hellwig wrote:
> On Sat, Nov 05, 2016 at 09:01:36AM -0600, Jason Gunthorpe wrote:
> > > Not entirely. You want me to keep the ABI file from the kernel in the
> > > fix up folder and also keep a file with the modified structs in
> > > providers/pvrdma?
> > 
> > Yes, and the file with the modified structs should include the kernel
> > header and duplicate it minimally. This will make it simpler for us to
> > eventually get rid of it.
> 
> Can we just automate generating the user header, e.g. have a sed script
> that recognized a magic comments ala
> 
> 	/* LIBIBVERBS PREAMBLE (DO NOT REMOVE) */
> 
> and just insers the needed fields?

Yah, something like that is a maybe, but if the drivers are going to
be small like pvrdma, I think we are OK with this approach.. I haven't
had time to audit what is needed beyond knowing that the uverbs header
is actually quite big.

Adit, make sure you push your patch to github too..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH rdma-core 0/8] libpvrdma: userspace library for PVRDMA
From: Adit Ranadive @ 2016-11-09 17:58 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, pv-drivers
In-Reply-To: <20161109175606.GB13467-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Wed, Nov 09, 2016 at 10:56:06AM -0700, Jason Gunthorpe wrote:
> 
> Adit, make sure you push your patch to github too..
> 
> Jason

Wasnt sure if other folks wanted to take a look at it but I will push it.

Thanks,
Adit 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [RFC ABI V5 02/10] RDMA/core: Add support for custom types
From: Hefty, Sean @ 2016-11-09 18:00 UTC (permalink / raw)
  To: Matan Barak
  Cc: Matan Barak, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Doug Ledford, Jason Gunthorpe, Christoph Lameter, Liran Liss,
	Haggai Eran, Majd Dibbiny, Tal Alon, Leon Romanovsky
In-Reply-To: <CAAKD3BDWyb10baLrDu=m_mYPB64i9OOPEPVYKtDo9zVbvMM-UA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 47059 bytes --]

> I had thought about that, but the user could initialize its part of
> the object in the function handler. It can't allocate the object as we
> need it in order to allocate an IDR entry and co. The assumption here
> is that the "unlock" stage can't fail.

This is creating a generic OO type of framework, so just add constructor/destructor functions and have all objects inherit from a base ioctl object class.

> > In fact, it would be great if we could just cleanup the list in the
> reverse order that items were created.  Maybe this requires supporting
> a pre-cleanup handler, so that the driver can pluck items out of the
> list that may need to be destroyed out of order.
> >
> 
> So that's essentially one layer of ordering. Why do you consider a
> driver iterating over all objects simpler than this model?

This problem is a verbs specific issue, and one that only involves MW .  We have reference counts that can provide the same functionality.  I want to avoid the amount of meta-data that needs to be used to describe objects.

> >> Adding an object is done in two parts.
> >> First, an object is allocated and added to IDR/fd table. Then, the
> >> command's handlers (in downstream patches) could work on this object
> >> and fill in its required details.
> >> After a successful command, ib_uverbs_uobject_enable is called and
> >> this user objects becomes ucontext visible.
> >
> > If you have a way to mark that an object is used for exclusive
> access, you may be able to use that instead of introducing a new
> variable.  (I.e. acquire the object's write lock).  I think we want to
> make an effort to minimize the size of the kernel structure needed to
> track every user space object (within reason).
> >
> 
> I didn't really follow. A command attribute states the nature of the
> locking (for example, in MODIFY_QP the QP could be exclusively locked,
> but in QUERY_QP it's only locked for reading). I don't want to really
> grab a lock, as if I were I could face a dead-lock (user-space could
> pass parameters in a colliding order), It could be solved by sorting
> the handles, but that would degrade performance without a good reasob.

I'm suggesting that the locking attribute and command be separate.  This allows the framework to acquire the proper type of lock independent of what function it will invoke.

The framework doesn't need to hold locks.  It should be able to mark access to an object.  If that access is not available, it can abort.  This pushes more complex synchronization and thread handling to user space.

> >> Removing an uboject is done by calling ib_uverbs_uobject_remove.
> >>
> >> We should make sure IDR (per-device) and list (per-ucontext) could
> >> be accessed concurrently without corrupting them.
> >>
> >> Signed-off-by: Matan Barak <matanb@mellanox.com>
> >> Signed-off-by: Haggai Eran <haggaie@mellanox.com>
> >> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> >> ---
> >
> > As a general comment, I do have concerns that the resulting
> generalized parsing of everything will negatively impact performance
> for operations that do have to transition into the kernel.  Not all
> devices offload all operations to user space.  Plus the resulting code
> is extremely difficult to read and non-trivial to use.  It's equivalent
> to reading C++ code that has 4 layers of inheritance with overrides to
> basic operators...
> 
> There are two parts here. I think the handlers themselves are simpler,
> easier to read and less error-prone. They contain less code
> duplications. The macro based define language explicitly declare all
> attributes, their types, size, etc.
> The model here is a bit more complex as we want to achieve both code
> resue and add/override of new types/actions/attributes.
> 
> 
> >
> > Pre and post operators per command that can do straightforward
> validation seem like a better option.
> >
> >
> 
> I think that would duplicate a lot of code and will be more
> error-prone than one infrastrucutre that automates all that work for
> you.

I think that's a toss-up.  Either you have to write the code correctly or write the rules correctly.  Reading code is straightforward, manually converting rules into code is not.

In any case, the two approaches are not exclusive.  By forcing the rule language into the framework, everything is forced to deal with it.  By leaving it out, each ioctl provider can decide if they need this or not.  If you want verbs to process all ioctl's using a single pre-validation function that operates based on these rules you can.  Nothing prevents that.  But ioctl providers that want better performance can elect for a more straightforward validation model.

> >>  drivers/infiniband/core/Makefile      |   3 +-
> >>  drivers/infiniband/core/device.c      |   1 +
> >>  drivers/infiniband/core/rdma_core.c   | 489
> >> ++++++++++++++++++++++++++++++++++
> >>  drivers/infiniband/core/rdma_core.h   |  75 ++++++
> >>  drivers/infiniband/core/uverbs.h      |   1 +
> >>  drivers/infiniband/core/uverbs_main.c |   2 +-
> >>  include/rdma/ib_verbs.h               |  28 +-
> >>  include/rdma/uverbs_ioctl.h           | 195 ++++++++++++++
> >>  8 files changed, 789 insertions(+), 5 deletions(-)
> >>  create mode 100644 drivers/infiniband/core/rdma_core.c
> >>  create mode 100644 drivers/infiniband/core/rdma_core.h
> >>  create mode 100644 include/rdma/uverbs_ioctl.h
> >>
> >> diff --git a/drivers/infiniband/core/Makefile
> >> b/drivers/infiniband/core/Makefile
> >> index edaae9f..1819623 100644
> >> --- a/drivers/infiniband/core/Makefile
> >> +++ b/drivers/infiniband/core/Makefile
> >> @@ -28,4 +28,5 @@ ib_umad-y :=                        user_mad.o
> >>
> >>  ib_ucm-y :=                  ucm.o
> >>
> >> -ib_uverbs-y :=                       uverbs_main.o uverbs_cmd.o
> >> uverbs_marshall.o
> >> +ib_uverbs-y :=                       uverbs_main.o uverbs_cmd.o
> >> uverbs_marshall.o \
> >> +                             rdma_core.o
> >> diff --git a/drivers/infiniband/core/device.c
> >> b/drivers/infiniband/core/device.c
> >> index c3b68f5..43994b1 100644
> >> --- a/drivers/infiniband/core/device.c
> >> +++ b/drivers/infiniband/core/device.c
> >> @@ -243,6 +243,7 @@ struct ib_device *ib_alloc_device(size_t size)
> >>       spin_lock_init(&device->client_data_lock);
> >>       INIT_LIST_HEAD(&device->client_data_list);
> >>       INIT_LIST_HEAD(&device->port_list);
> >> +     INIT_LIST_HEAD(&device->type_list);
> >>
> >>       return device;
> >>  }
> >> diff --git a/drivers/infiniband/core/rdma_core.c
> >> b/drivers/infiniband/core/rdma_core.c
> >> new file mode 100644
> >> index 0000000..337abc2
> >> --- /dev/null
> >> +++ b/drivers/infiniband/core/rdma_core.c
> >> @@ -0,0 +1,489 @@
> >> +/*
> >> + * Copyright (c) 2016, Mellanox Technologies inc.  All rights
> >> reserved.
> >> + *
> >> + * This software is available to you under a choice of one of two
> >> + * licenses.  You may choose to be licensed under the terms of the
> GNU
> >> + * General Public License (GPL) Version 2, available from the file
> >> + * COPYING in the main directory of this source tree, or the
> >> + * OpenIB.org BSD license below:
> >> + *
> >> + *     Redistribution and use in source and binary forms, with or
> >> + *     without modification, are permitted provided that the
> following
> >> + *     conditions are met:
> >> + *
> >> + *      - Redistributions of source code must retain the above
> >> + *        copyright notice, this list of conditions and the
> following
> >> + *        disclaimer.
> >> + *
> >> + *      - Redistributions in binary form must reproduce the above
> >> + *        copyright notice, this list of conditions and the
> following
> >> + *        disclaimer in the documentation and/or other materials
> >> + *        provided with the distribution.
> >> + *
> >> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> >> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
> OF
> >> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> >> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
> HOLDERS
> >> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN
> AN
> >> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
> IN
> >> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> >> + * SOFTWARE.
> >> + */
> >> +
> >> +#include <linux/file.h>
> >> +#include <linux/anon_inodes.h>
> >> +#include <rdma/ib_verbs.h>
> >> +#include "uverbs.h"
> >> +#include "rdma_core.h"
> >> +#include <rdma/uverbs_ioctl.h>
> >> +
> >> +const struct uverbs_type *uverbs_get_type(const struct ib_device
> >> *ibdev,
> >> +                                       uint16_t type)
> >> +{
> >> +     const struct uverbs_types_group *groups = ibdev->types_group;
> >> +     const struct uverbs_types *types;
> >> +     int ret = groups->dist(&type, groups->priv);
> >> +
> >> +     if (ret >= groups->num_groups)
> >> +             return NULL;
> >> +
> >> +     types = groups->type_groups[ret];
> >> +
> >> +     if (type >= types->num_types)
> >> +             return NULL;
> >> +
> >> +     return types->types[type];
> >> +}
> >> +
> >> +static int uverbs_lock_object(struct ib_uobject *uobj,
> >> +                           enum uverbs_idr_access access)
> >> +{
> >> +     if (access == UVERBS_IDR_ACCESS_READ)
> >> +             return down_read_trylock(&uobj->usecnt) == 1 ? 0 : -
> EBUSY;
> >> +
> >> +     /* lock is either WRITE or DESTROY - should be exclusive */
> >> +     return down_write_trylock(&uobj->usecnt) == 1 ? 0 : -EBUSY;
> >
> > This function could take the lock type directly (read or write),
> versus inferring it based on some other access type.
> >
> 
> We can, but since we use these enums in the attribute specifications,
> I thought it could be more convinient.
> 
> >> +}
> >> +
> >> +static struct ib_uobject *get_uobj(int id, struct ib_ucontext
> >> *context)
> >> +{
> >> +     struct ib_uobject *uobj;
> >> +
> >> +     rcu_read_lock();
> >> +     uobj = idr_find(&context->device->idr, id);
> >> +     if (uobj && uobj->live) {
> >> +             if (uobj->context != context)
> >> +                     uobj = NULL;
> >> +     }
> >> +     rcu_read_unlock();
> >> +
> >> +     return uobj;
> >> +}
> >> +
> >> +struct ib_ucontext_lock {
> >> +     struct kref  ref;
> >> +     /* locking the uobjects_list */
> >> +     struct mutex lock;
> >> +};
> >> +
> >> +static void init_uobjects_list_lock(struct ib_ucontext_lock *lock)
> >> +{
> >> +     mutex_init(&lock->lock);
> >> +     kref_init(&lock->ref);
> >> +}
> >> +
> >> +static void release_uobjects_list_lock(struct kref *ref)
> >> +{
> >> +     struct ib_ucontext_lock *lock = container_of(ref,
> >> +                                                  struct
> ib_ucontext_lock,
> >> +                                                  ref);
> >> +
> >> +     kfree(lock);
> >> +}
> >> +
> >> +static void init_uobj(struct ib_uobject *uobj, u64 user_handle,
> >> +                   struct ib_ucontext *context)
> >> +{
> >> +     init_rwsem(&uobj->usecnt);
> >> +     uobj->user_handle = user_handle;
> >> +     uobj->context     = context;
> >> +     uobj->live        = 0;
> >> +}
> >> +
> >> +static int add_uobj(struct ib_uobject *uobj)
> >> +{
> >> +     int ret;
> >> +
> >> +     idr_preload(GFP_KERNEL);
> >> +     spin_lock(&uobj->context->device->idr_lock);
> >> +
> >> +     ret = idr_alloc(&uobj->context->device->idr, uobj, 0, 0,
> >> GFP_NOWAIT);
> >> +     if (ret >= 0)
> >> +             uobj->id = ret;
> >> +
> >> +     spin_unlock(&uobj->context->device->idr_lock);
> >> +     idr_preload_end();
> >> +
> >> +     return ret < 0 ? ret : 0;
> >> +}
> >> +
> >> +static void remove_uobj(struct ib_uobject *uobj)
> >> +{
> >> +     spin_lock(&uobj->context->device->idr_lock);
> >> +     idr_remove(&uobj->context->device->idr, uobj->id);
> >> +     spin_unlock(&uobj->context->device->idr_lock);
> >> +}
> >> +
> >> +static void put_uobj(struct ib_uobject *uobj)
> >> +{
> >> +     kfree_rcu(uobj, rcu);
> >> +}
> >> +
> >> +static struct ib_uobject *get_uobject_from_context(struct
> ib_ucontext
> >> *ucontext,
> >> +                                                const struct
> >> uverbs_type_alloc_action *type,
> >> +                                                u32 idr,
> >> +                                                enum
> uverbs_idr_access access)
> >> +{
> >> +     struct ib_uobject *uobj;
> >> +     int ret;
> >> +
> >> +     rcu_read_lock();
> >> +     uobj = get_uobj(idr, ucontext);
> >> +     if (!uobj)
> >> +             goto free;
> >> +
> >> +     if (uobj->type != type) {
> >> +             uobj = NULL;
> >> +             goto free;
> >> +     }
> >> +
> >> +     ret = uverbs_lock_object(uobj, access);
> >> +     if (ret)
> >> +             uobj = ERR_PTR(ret);
> >> +free:
> >> +     rcu_read_unlock();
> >> +     return uobj;
> >> +
> >> +     return NULL;
> >> +}
> >> +
> >> +static int ib_uverbs_uobject_add(struct ib_uobject *uobject,
> >> +                              const struct uverbs_type_alloc_action
> >> *uobject_type)
> >> +{
> >> +     uobject->type = uobject_type;
> >> +     return add_uobj(uobject);
> >> +}
> >> +
> >> +struct ib_uobject *uverbs_get_type_from_idr(const struct
> >> uverbs_type_alloc_action *type,
> >> +                                         struct ib_ucontext
> *ucontext,
> >> +                                         enum uverbs_idr_access
> access,
> >> +                                         uint32_t idr)
> >> +{
> >> +     struct ib_uobject *uobj;
> >> +     int ret;
> >> +
> >> +     if (access == UVERBS_IDR_ACCESS_NEW) {
> >> +             uobj = kmalloc(type->obj_size, GFP_KERNEL);
> >> +             if (!uobj)
> >> +                     return ERR_PTR(-ENOMEM);
> >> +
> >> +             init_uobj(uobj, 0, ucontext);
> >> +
> >> +             /* lock idr */
> >
> > Command to lock idr, but no lock is obtained.
> >
> 
> ib_uverbs_uobject_add calls add_uobj which locks the IDR.
> 
> >> +             ret = ib_uverbs_uobject_add(uobj, type);
> >> +             if (ret) {
> >> +                     kfree(uobj);
> >> +                     return ERR_PTR(ret);
> >> +             }
> >> +
> >> +     } else {
> >> +             uobj = get_uobject_from_context(ucontext, type, idr,
> >> +                                             access);
> >> +
> >> +             if (!uobj)
> >> +                     return ERR_PTR(-ENOENT);
> >> +     }
> >> +
> >> +     return uobj;
> >> +}
> >> +
> >> +struct ib_uobject *uverbs_get_type_from_fd(const struct
> >> uverbs_type_alloc_action *type,
> >> +                                        struct ib_ucontext
> *ucontext,
> >> +                                        enum uverbs_idr_access
> access,
> >> +                                        int fd)
> >> +{
> >> +     if (access == UVERBS_IDR_ACCESS_NEW) {
> >> +             int _fd;
> >> +             struct ib_uobject *uobj = NULL;
> >> +             struct file *filp;
> >> +
> >> +             _fd = get_unused_fd_flags(O_CLOEXEC);
> >> +             if (_fd < 0 || WARN_ON(type->obj_size < sizeof(struct
> >> ib_uobject)))
> >> +                     return ERR_PTR(_fd);
> >> +
> >> +             uobj = kmalloc(type->obj_size, GFP_KERNEL);
> >> +             init_uobj(uobj, 0, ucontext);
> >> +
> >> +             if (!uobj)
> >> +                     return ERR_PTR(-ENOMEM);
> >> +
> >> +             filp = anon_inode_getfile(type->fd.name, type-
> >fd.fops,
> >> +                                       uobj + 1, type->fd.flags);
> >> +             if (IS_ERR(filp)) {
> >> +                     put_unused_fd(_fd);
> >> +                     kfree(uobj);
> >> +                     return (void *)filp;
> >> +             }
> >> +
> >> +             uobj->type = type;
> >> +             uobj->id = _fd;
> >> +             uobj->object = filp;
> >> +
> >> +             return uobj;
> >> +     } else if (access == UVERBS_IDR_ACCESS_READ) {
> >> +             struct file *f = fget(fd);
> >> +             struct ib_uobject *uobject;
> >> +
> >> +             if (!f)
> >> +                     return ERR_PTR(-EBADF);
> >> +
> >> +             uobject = f->private_data - sizeof(struct ib_uobject);
> >> +             if (f->f_op != type->fd.fops ||
> >> +                 !uobject->live) {
> >> +                     fput(f);
> >> +                     return ERR_PTR(-EBADF);
> >> +             }
> >> +
> >> +             /*
> >> +              * No need to protect it with a ref count, as fget
> >> increases
> >> +              * f_count.
> >> +              */
> >> +             return uobject;
> >> +     } else {
> >> +             return ERR_PTR(-EOPNOTSUPP);
> >> +     }
> >> +}
> >> +
> >> +static void ib_uverbs_uobject_enable(struct ib_uobject *uobject)
> >> +{
> >> +     mutex_lock(&uobject->context->uobjects_lock->lock);
> >> +     list_add(&uobject->list, &uobject->context->uobjects);
> >> +     mutex_unlock(&uobject->context->uobjects_lock->lock);
> >
> > Why not just insert the object into the list on creation?
> >
> >> +     uobject->live = 1;
> >
> > See my comments above on removing the live field.
> >
> 
> Seems that the list could suffice, but I'll look into that.
> 
> >> +}
> >> +
> >> +static void ib_uverbs_uobject_remove(struct ib_uobject *uobject,
> bool
> >> lock)
> >> +{
> >> +     /*
> >> +      * Calling remove requires exclusive access, so it's not
> possible
> >> +      * another thread will use our object.
> >> +      */
> >> +     uobject->live = 0;
> >> +     uobject->type->free_fn(uobject->type, uobject);
> >> +     if (lock)
> >> +             mutex_lock(&uobject->context->uobjects_lock->lock);
> >> +     list_del(&uobject->list);
> >> +     if (lock)
> >> +             mutex_unlock(&uobject->context->uobjects_lock->lock);
> >> +     remove_uobj(uobject);
> >> +     put_uobj(uobject);
> >> +}
> >> +
> >> +static void uverbs_unlock_idr(struct ib_uobject *uobj,
> >> +                           enum uverbs_idr_access access,
> >> +                           bool success)
> >> +{
> >> +     switch (access) {
> >> +     case UVERBS_IDR_ACCESS_READ:
> >> +             up_read(&uobj->usecnt);
> >> +             break;
> >> +     case UVERBS_IDR_ACCESS_NEW:
> >> +             if (success) {
> >> +                     ib_uverbs_uobject_enable(uobj);
> >> +             } else {
> >> +                     remove_uobj(uobj);
> >> +                     put_uobj(uobj);
> >> +             }
> >> +             break;
> >> +     case UVERBS_IDR_ACCESS_WRITE:
> >> +             up_write(&uobj->usecnt);
> >> +             break;
> >> +     case UVERBS_IDR_ACCESS_DESTROY:
> >> +             if (success)
> >> +                     ib_uverbs_uobject_remove(uobj, true);
> >> +             else
> >> +                     up_write(&uobj->usecnt);
> >> +             break;
> >> +     }
> >> +}
> >> +
> >> +static void uverbs_unlock_fd(struct ib_uobject *uobj,
> >> +                          enum uverbs_idr_access access,
> >> +                          bool success)
> >> +{
> >> +     struct file *filp = uobj->object;
> >> +
> >> +     if (access == UVERBS_IDR_ACCESS_NEW) {
> >> +             if (success) {
> >> +                     kref_get(&uobj->context->ufile->ref);
> >> +                     uobj->uobjects_lock = uobj->context-
> >uobjects_lock;
> >> +                     kref_get(&uobj->uobjects_lock->ref);
> >> +                     ib_uverbs_uobject_enable(uobj);
> >> +                     fd_install(uobj->id, uobj->object);
> >
> > I don't get this.  The function is unlocking something, but there are
> calls to get krefs?
> >
> 
> Before invoking the user's callback, we're first locking all objects
> and afterwards we're unlocking them. When we need to create a new
> object, the lock becomes object creation and the unlock could become
> (assuming the user's callback succeeded) enabling this new object.
> When you add a new object (or fd in this case), we take a reference
> count to both the uverbs_file and the locking context.
> 
> >> +             } else {
> >> +                     fput(uobj->object);
> >> +                     put_unused_fd(uobj->id);
> >> +                     kfree(uobj);
> >> +             }
> >> +     } else {
> >> +             fput(filp);
> >> +     }
> >> +}
> >> +
> >> +void uverbs_unlock_object(struct ib_uobject *uobj,
> >> +                       enum uverbs_idr_access access,
> >> +                       bool success)
> >> +{
> >> +     if (uobj->type->type == UVERBS_ATTR_TYPE_IDR)
> >> +             uverbs_unlock_idr(uobj, access, success);
> >> +     else if (uobj->type->type == UVERBS_ATTR_TYPE_FD)
> >> +             uverbs_unlock_fd(uobj, access, success);
> >> +     else
> >> +             WARN_ON(true);
> >> +}
> >> +
> >> +static void ib_uverbs_remove_fd(struct ib_uobject *uobject)
> >> +{
> >> +     /*
> >> +      * user should release the uobject in the release
> >> +      * callback.
> >> +      */
> >> +     if (uobject->live) {
> >> +             uobject->live = 0;
> >> +             list_del(&uobject->list);
> >> +             uobject->type->free_fn(uobject->type, uobject);
> >> +             kref_put(&uobject->context->ufile->ref,
> >> ib_uverbs_release_file);
> >> +             uobject->context = NULL;
> >> +     }
> >> +}
> >> +
> >> +void ib_uverbs_close_fd(struct file *f)
> >> +{
> >> +     struct ib_uobject *uobject = f->private_data - sizeof(struct
> >> ib_uobject);
> >> +
> >> +     mutex_lock(&uobject->uobjects_lock->lock);
> >> +     if (uobject->live) {
> >> +             uobject->live = 0;
> >> +             list_del(&uobject->list);
> >> +             kref_put(&uobject->context->ufile->ref,
> >> ib_uverbs_release_file);
> >> +             uobject->context = NULL;
> >> +     }
> >> +     mutex_unlock(&uobject->uobjects_lock->lock);
> >> +     kref_put(&uobject->uobjects_lock->ref,
> >> release_uobjects_list_lock);
> >> +}
> >> +
> >> +void ib_uverbs_cleanup_fd(void *private_data)
> >> +{
> >> +     struct ib_uboject *uobject = private_data - sizeof(struct
> >> ib_uobject);
> >> +
> >> +     kfree(uobject);
> >> +}
> >> +
> >> +void uverbs_unlock_objects(struct uverbs_attr_array *attr_array,
> >> +                        size_t num,
> >> +                        const struct uverbs_action_spec *spec,
> >> +                        bool success)
> >> +{
> >> +     unsigned int i;
> >> +
> >> +     for (i = 0; i < num; i++) {
> >> +             struct uverbs_attr_array *attr_spec_array =
> &attr_array[i];
> >> +             const struct uverbs_attr_group_spec *group_spec =
> >> +                     spec->attr_groups[i];
> >> +             unsigned int j;
> >> +
> >> +             for (j = 0; j < attr_spec_array->num_attrs; j++) {
> >> +                     struct uverbs_attr *attr = &attr_spec_array-
> >> >attrs[j];
> >> +                     struct uverbs_attr_spec *spec = &group_spec-
> >> >attrs[j];
> >> +
> >> +                     if (!attr->valid)
> >> +                             continue;
> >> +
> >> +                     if (spec->type == UVERBS_ATTR_TYPE_IDR ||
> >> +                         spec->type == UVERBS_ATTR_TYPE_FD)
> >> +                             /*
> >> +                              * refcounts should be handled at the
> object
> >> +                              * level and not at the uobject level.
> >> +                              */
> >> +                             uverbs_unlock_object(attr-
> >obj_attr.uobject,
> >> +                                                  spec->obj.access,
> success);
> >> +             }
> >> +     }
> >> +}
> >> +
> >> +static unsigned int get_type_orders(const struct uverbs_types_group
> >> *types_group)
> >> +{
> >> +     unsigned int i;
> >> +     unsigned int max = 0;
> >> +
> >> +     for (i = 0; i < types_group->num_groups; i++) {
> >> +             unsigned int j;
> >> +             const struct uverbs_types *types = types_group-
> >> >type_groups[i];
> >> +
> >> +             for (j = 0; j < types->num_types; j++) {
> >> +                     if (!types->types[j] || !types->types[j]-
> >alloc)
> >> +                             continue;
> >> +                     if (types->types[j]->alloc->order > max)
> >> +                             max = types->types[j]->alloc->order;
> >> +             }
> >> +     }
> >> +
> >> +     return max;
> >> +}
> >> +
> >> +void ib_uverbs_uobject_type_cleanup_ucontext(struct ib_ucontext
> >> *ucontext,
> >> +                                          const struct
> uverbs_types_group
> >> *types_group)
> >> +{
> >> +     unsigned int num_orders = get_type_orders(types_group);
> >> +     unsigned int i;
> >> +
> >> +     for (i = 0; i <= num_orders; i++) {
> >> +             struct ib_uobject *obj, *next_obj;
> >> +
> >> +             /*
> >> +              * No need to take lock here, as cleanup should be
> called
> >> +              * after all commands finished executing. Newly
> executed
> >> +              * commands should fail.
> >> +              */
> >> +             mutex_lock(&ucontext->uobjects_lock->lock);
> >
> > It's really confusing to see a comment about 'no need to take lock'
> immediately followed by a call to lock.
> >
> 
> Yeah :) That was before adding the fd. I'll delete the comment.
> 
> >> +             list_for_each_entry_safe(obj, next_obj, &ucontext-
> >> >uobjects,
> >> +                                      list)
> >> +                     if (obj->type->order == i) {
> >> +                             if (obj->type->type ==
> UVERBS_ATTR_TYPE_IDR)
> >> +                                     ib_uverbs_uobject_remove(obj,
> false);
> >> +                             else
> >> +                                     ib_uverbs_remove_fd(obj);
> >> +                     }
> >> +             mutex_unlock(&ucontext->uobjects_lock->lock);
> >> +     }
> >> +     kref_put(&ucontext->uobjects_lock->ref,
> >> release_uobjects_list_lock);
> >> +}
> >> +
> >> +int ib_uverbs_uobject_type_initialize_ucontext(struct ib_ucontext
> >> *ucontext)
> >
> > Please work on the function names.  This is horrendously long and
> still doesn't help describe what it does.
> >
> 
> This just initialized the types part of the ucontext. Any suggestions?
> 
> >> +{
> >> +     ucontext->uobjects_lock = kmalloc(sizeof(*ucontext-
> >> >uobjects_lock),
> >> +                                       GFP_KERNEL);
> >> +     if (!ucontext->uobjects_lock)
> >> +             return -ENOMEM;
> >> +
> >> +     init_uobjects_list_lock(ucontext->uobjects_lock);
> >> +     INIT_LIST_HEAD(&ucontext->uobjects);
> >> +
> >> +     return 0;
> >> +}
> >> +
> >> +void ib_uverbs_uobject_type_release_ucontext(struct ib_ucontext
> >> *ucontext)
> >> +{
> >> +     kfree(ucontext->uobjects_lock);
> >> +}
> >
> > No need to wrap a call to 'free'.
> >
> 
> In order to abstract away the ucontext type data structure.
> 
> >> +
> >> diff --git a/drivers/infiniband/core/rdma_core.h
> >> b/drivers/infiniband/core/rdma_core.h
> >> new file mode 100644
> >> index 0000000..8990115
> >> --- /dev/null
> >> +++ b/drivers/infiniband/core/rdma_core.h
> >> @@ -0,0 +1,75 @@
> >> +/*
> >> + * Copyright (c) 2005 Topspin Communications.  All rights reserved.
> >> + * Copyright (c) 2005, 2006 Cisco Systems.  All rights reserved.
> >> + * Copyright (c) 2005-2016 Mellanox Technologies. All rights
> reserved.
> >> + * Copyright (c) 2005 Voltaire, Inc. All rights reserved.
> >> + * Copyright (c) 2005 PathScale, Inc. All rights reserved.
> >> + *
> >> + * This software is available to you under a choice of one of two
> >> + * licenses.  You may choose to be licensed under the terms of the
> GNU
> >> + * General Public License (GPL) Version 2, available from the file
> >> + * COPYING in the main directory of this source tree, or the
> >> + * OpenIB.org BSD license below:
> >> + *
> >> + *     Redistribution and use in source and binary forms, with or
> >> + *     without modification, are permitted provided that the
> following
> >> + *     conditions are met:
> >> + *
> >> + *      - Redistributions of source code must retain the above
> >> + *        copyright notice, this list of conditions and the
> following
> >> + *        disclaimer.
> >> + *
> >> + *      - Redistributions in binary form must reproduce the above
> >> + *        copyright notice, this list of conditions and the
> following
> >> + *        disclaimer in the documentation and/or other materials
> >> + *        provided with the distribution.
> >> + *
> >> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> >> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
> OF
> >> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> >> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
> HOLDERS
> >> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN
> AN
> >> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
> IN
> >> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> >> + * SOFTWARE.
> >> + */
> >> +
> >> +#ifndef UOBJECT_H
> >> +#define UOBJECT_H
> >> +
> >> +#include <linux/idr.h>
> >> +#include <rdma/uverbs_ioctl.h>
> >> +#include <rdma/ib_verbs.h>
> >> +#include <linux/mutex.h>
> >> +
> >> +const struct uverbs_type *uverbs_get_type(const struct ib_device
> >> *ibdev,
> >> +                                       uint16_t type);
> >> +struct ib_uobject *uverbs_get_type_from_idr(const struct
> >> uverbs_type_alloc_action *type,
> >> +                                         struct ib_ucontext
> *ucontext,
> >> +                                         enum uverbs_idr_access
> access,
> >> +                                         uint32_t idr);
> >> +struct ib_uobject *uverbs_get_type_from_fd(const struct
> >> uverbs_type_alloc_action *type,
> >> +                                        struct ib_ucontext
> *ucontext,
> >> +                                        enum uverbs_idr_access
> access,
> >> +                                        int fd);
> >> +void uverbs_unlock_object(struct ib_uobject *uobj,
> >> +                       enum uverbs_idr_access access,
> >> +                       bool success);
> >> +void uverbs_unlock_objects(struct uverbs_attr_array *attr_array,
> >> +                        size_t num,
> >> +                        const struct uverbs_action_spec *spec,
> >> +                        bool success);
> >> +
> >> +void ib_uverbs_uobject_type_cleanup_ucontext(struct ib_ucontext
> >> *ucontext,
> >> +                                          const struct
> uverbs_types_group
> >> *types_group);
> >> +int ib_uverbs_uobject_type_initialize_ucontext(struct ib_ucontext
> >> *ucontext);
> >> +void ib_uverbs_uobject_type_release_ucontext(struct ib_ucontext
> >> *ucontext);
> >> +void ib_uverbs_close_fd(struct file *f);
> >> +void ib_uverbs_cleanup_fd(void *private_data);
> >> +
> >> +static inline void *uverbs_fd_to_priv(struct ib_uobject *uobj)
> >> +{
> >> +     return uobj + 1;
> >> +}
> >
> > This seems like a rather useless function.
> >
> 
> Why? The user sholdn't know or care how we put our structs together.
> 
> >> +
> >> +#endif /* UIDR_H */
> >> diff --git a/drivers/infiniband/core/uverbs.h
> >> b/drivers/infiniband/core/uverbs.h
> >> index 8074705..ae7d4b8 100644
> >> --- a/drivers/infiniband/core/uverbs.h
> >> +++ b/drivers/infiniband/core/uverbs.h
> >> @@ -180,6 +180,7 @@ void idr_remove_uobj(struct ib_uobject *uobj);
> >>  struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file
> >> *uverbs_file,
> >>                                       struct ib_device *ib_dev,
> >>                                       int is_async);
> >> +void ib_uverbs_release_file(struct kref *ref);
> >>  void ib_uverbs_free_async_event_file(struct ib_uverbs_file
> >> *uverbs_file);
> >>  struct ib_uverbs_event_file *ib_uverbs_lookup_comp_file(int fd);
> >>
> >> diff --git a/drivers/infiniband/core/uverbs_main.c
> >> b/drivers/infiniband/core/uverbs_main.c
> >> index f783723..e63357a 100644
> >> --- a/drivers/infiniband/core/uverbs_main.c
> >> +++ b/drivers/infiniband/core/uverbs_main.c
> >> @@ -341,7 +341,7 @@ static void ib_uverbs_comp_dev(struct
> >> ib_uverbs_device *dev)
> >>       complete(&dev->comp);
> >>  }
> >>
> >> -static void ib_uverbs_release_file(struct kref *ref)
> >> +void ib_uverbs_release_file(struct kref *ref)
> >>  {
> >>       struct ib_uverbs_file *file =
> >>               container_of(ref, struct ib_uverbs_file, ref);
> >> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> >> index b5d2075..7240615 100644
> >> --- a/include/rdma/ib_verbs.h
> >> +++ b/include/rdma/ib_verbs.h
> >> @@ -1329,8 +1329,11 @@ struct ib_fmr_attr {
> >>
> >>  struct ib_umem;
> >>
> >> +struct ib_ucontext_lock;
> >> +
> >>  struct ib_ucontext {
> >>       struct ib_device       *device;
> >> +     struct ib_uverbs_file  *ufile;
> >>       struct list_head        pd_list;
> >>       struct list_head        mr_list;
> >>       struct list_head        mw_list;
> >> @@ -1344,6 +1347,10 @@ struct ib_ucontext {
> >>       struct list_head        rwq_ind_tbl_list;
> >>       int                     closing;
> >>
> >> +     /* lock for uobjects list */
> >> +     struct ib_ucontext_lock *uobjects_lock;
> >> +     struct list_head        uobjects;
> >> +
> >>       struct pid             *tgid;
> >>  #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
> >>       struct rb_root      umem_tree;
> >> @@ -1363,16 +1370,28 @@ struct ib_ucontext {
> >>  #endif
> >>  };
> >>
> >> +struct uverbs_object_list;
> >> +
> >> +#define OLD_ABI_COMPAT
> >> +
> >>  struct ib_uobject {
> >>       u64                     user_handle;    /* handle given to us
> by userspace
> >> */
> >>       struct ib_ucontext     *context;        /* associated user
> context
> >> */
> >>       void                   *object;         /* containing object
> */
> >>       struct list_head        list;           /* link to context's
> list */
> >> -     int                     id;             /* index into kernel
> idr */
> >> -     struct kref             ref;
> >> -     struct rw_semaphore     mutex;          /* protects .live */
> >> +     int                     id;             /* index into kernel
> idr/fd */
> >> +#ifdef OLD_ABI_COMPAT
> >> +     struct kref             ref;
> >> +#endif
> >> +     struct rw_semaphore     usecnt;         /* protects exclusive
> >> access */
> >> +#ifdef OLD_ABI_COMPAT
> >> +     struct rw_semaphore     mutex;          /* protects .live */
> >> +#endif
> >>       struct rcu_head         rcu;            /* kfree_rcu()
> overhead */
> >>       int                     live;
> >> +
> >> +     const struct uverbs_type_alloc_action *type;
> >> +     struct ib_ucontext_lock *uobjects_lock;
> >>  };
> >>
> >>  struct ib_udata {
> >> @@ -2101,6 +2120,9 @@ struct ib_device {
> >>        */
> >>       int (*get_port_immutable)(struct ib_device *, u8, struct
> >> ib_port_immutable *);
> >>       void (*get_dev_fw_str)(struct ib_device *, char *str, size_t
> >> str_len);
> >> +     struct list_head type_list;
> >> +
> >> +     const struct uverbs_types_group *types_group;
> >>  };
> >>
> >>  struct ib_client {
> >> diff --git a/include/rdma/uverbs_ioctl.h
> b/include/rdma/uverbs_ioctl.h
> >> new file mode 100644
> >> index 0000000..2f50045
> >> --- /dev/null
> >> +++ b/include/rdma/uverbs_ioctl.h
> >> @@ -0,0 +1,195 @@
> >> +/*
> >> + * Copyright (c) 2016, Mellanox Technologies inc.  All rights
> >> reserved.
> >> + *
> >> + * This software is available to you under a choice of one of two
> >> + * licenses.  You may choose to be licensed under the terms of the
> GNU
> >> + * General Public License (GPL) Version 2, available from the file
> >> + * COPYING in the main directory of this source tree, or the
> >> + * OpenIB.org BSD license below:
> >> + *
> >> + *     Redistribution and use in source and binary forms, with or
> >> + *     without modification, are permitted provided that the
> following
> >> + *     conditions are met:
> >> + *
> >> + *      - Redistributions of source code must retain the above
> >> + *        copyright notice, this list of conditions and the
> following
> >> + *        disclaimer.
> >> + *
> >> + *      - Redistributions in binary form must reproduce the above
> >> + *        copyright notice, this list of conditions and the
> following
> >> + *        disclaimer in the documentation and/or other materials
> >> + *        provided with the distribution.
> >> + *
> >> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> >> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
> OF
> >> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> >> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
> HOLDERS
> >> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN
> AN
> >> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
> IN
> >> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> >> + * SOFTWARE.
> >> + */
> >> +
> >> +#ifndef _UVERBS_IOCTL_
> >> +#define _UVERBS_IOCTL_
> >> +
> >> +#include <linux/kernel.h>
> >> +
> >> +struct uverbs_object_type;
> >> +struct ib_ucontext;
> >> +struct ib_uobject;
> >> +struct ib_device;
> >> +struct uverbs_uobject_type;
> >> +
> >> +/*
> >> + * =======================================
> >> + *   Verbs action specifications
> >> + * =======================================
> >> + */
> >
> > I intentionally used urdma (though condensed to 3 letters that I
> don't recall atm), rather than uverbs.  This will need to work with
> non-verbs devices and interfaces -- again, consider how this fits with
> the rdma cm.  Verbs has a very specific meaning, which gets lost if we
> start referring to everything as 'verbs'.  It's bad enough that we're
> stuck with 'drivers/infiniband' and 'rdma', such that 'infiniband' also
> means ethernet and rdma means nothing.
> >
> 
> IMHO - let's agree on the concept of this infrastructure. One we
> decide its scope, we could generalize it (i.e - ioctl_provider and
> ioctl_context) and implement it to rdma-cm as well.
> 
> >> +
> >> +enum uverbs_attr_type {
> >> +     UVERBS_ATTR_TYPE_PTR_IN,
> >> +     UVERBS_ATTR_TYPE_PTR_OUT,
> >> +     UVERBS_ATTR_TYPE_IDR,
> >> +     UVERBS_ATTR_TYPE_FD,
> >> +};
> >> +
> >> +enum uverbs_idr_access {
> >> +     UVERBS_IDR_ACCESS_READ,
> >> +     UVERBS_IDR_ACCESS_WRITE,
> >> +     UVERBS_IDR_ACCESS_NEW,
> >> +     UVERBS_IDR_ACCESS_DESTROY
> >> +};
> >> +
> >> +struct uverbs_attr_spec {
> >> +     u16                             len;
> >> +     enum uverbs_attr_type           type;
> >> +     struct {
> >> +             u16                     obj_type;
> >> +             u8                      access;
> >
> > Is access intended to be an enum uverbs_idr_access value?
> >
> 
> Yeah, worth using this enum. Thanks.
> 
> >> +     } obj;
> >
> > I would remove (flatten) the substructure and re-order the fields for
> better alignment.
> >
> 
> I noticed there are several places which aren't aliged. It's in my todo
> list.
> 
> >> +};
> >> +
> >> +struct uverbs_attr_group_spec {
> >> +     struct uverbs_attr_spec         *attrs;
> >> +     size_t                          num_attrs;
> >> +};
> >> +
> >> +struct uverbs_action_spec {
> >> +     const struct uverbs_attr_group_spec             **attr_groups;
> >> +     /* if > 0 -> validator, otherwise, error */
> >
> > ? not sure what this comment means
> >
> >> +     int (*dist)(__u16 *attr_id, void *priv);
> >
> > What does 'dist' stand for?
> >
> 
> dist = distribution function.
> It maps the attributes you got from the user-space to your groups. You
> could think of each group as a namespace - where its attributes (or
> types/actions) starts from zero in the sake of compactness.
> So, for example, it gets an attribute 0x8010 and maps to to "group 1"
> (provider) and attribute 0x10.
> 
> >> +     void                                            *priv;
> >> +     size_t                                          num_groups;
> >> +};
> >> +
> >> +struct uverbs_attr_array;
> >> +struct ib_uverbs_file;
> >> +
> >> +struct uverbs_action {
> >> +     struct uverbs_action_spec spec;
> >> +     void *priv;
> >> +     int (*handler)(struct ib_device *ib_dev, struct ib_uverbs_file
> >> *ufile,
> >> +                    struct uverbs_attr_array *ctx, size_t num, void
> >> *priv);
> >> +};
> >> +
> >> +struct uverbs_type_alloc_action;
> >> +typedef void (*free_type)(const struct uverbs_type_alloc_action
> >> *uobject_type,
> >> +                       struct ib_uobject *uobject);
> >> +
> >> +struct uverbs_type_alloc_action {
> >> +     enum uverbs_attr_type           type;
> >> +     int                             order;
> >
> > I think this is being used as destroy order, in which case I would
> rename it to clarify the intent.  Though I'd prefer we come up with a
> more efficient destruction mechanism than the repeated nested looping.
> >
> 
> In one of the earlier revisions I used a sorted list, which was
> efficient. I recall that Jason didn't like its complexity and
> re-thinking about that - he's right. Most of your types are "order
> number" 0 anyway. So you'll probably iterate very few objects in the
> next round (in verbs, everything but MRs and PDs).
> 
> >> +     size_t                          obj_size;
> >
> > This can be alloc_fn
> >
> >> +     free_type                       free_fn;
> >> +     struct {
> >> +             const struct file_operations    *fops;
> >> +             const char                      *name;
> >> +             int                             flags;
> >> +     } fd;
> >> +};
> >> +
> >> +struct uverbs_type_actions_group {
> >> +     size_t                                  num_actions;
> >> +     const struct uverbs_action              **actions;
> >> +};
> >> +
> >> +struct uverbs_type {
> >> +     size_t                                  num_groups;
> >> +     const struct uverbs_type_actions_group  **action_groups;
> >> +     const struct uverbs_type_alloc_action   *alloc;
> >> +     int (*dist)(__u16 *action_id, void *priv);
> >> +     void                                    *priv;
> >> +};
> >> +
> >> +struct uverbs_types {
> >> +     size_t                                  num_types;
> >> +     const struct uverbs_type                **types;
> >> +};
> >> +
> >> +struct uverbs_types_group {
> >> +     const struct uverbs_types               **type_groups;
> >> +     size_t                                  num_groups;
> >> +     int (*dist)(__u16 *type_id, void *priv);
> >> +     void                                    *priv;
> >> +};
> >> +
> >> +/* =================================================
> >> + *              Parsing infrastructure
> >> + * =================================================
> >> + */
> >> +
> >> +struct uverbs_ptr_attr {
> >> +     void    * __user ptr;
> >> +     __u16           len;
> >> +};
> >> +
> >> +struct uverbs_fd_attr {
> >> +     int             fd;
> >> +};
> >> +
> >> +struct uverbs_uobj_attr {
> >> +     /*  idr handle */
> >> +     __u32   idr;
> >> +};
> >> +
> >> +struct uverbs_obj_attr {
> >> +     /* pointer to the kernel descriptor -> type, access, etc */
> >> +     const struct uverbs_attr_spec *val;
> >> +     struct ib_uverbs_attr __user    *uattr;
> >> +     const struct uverbs_type_alloc_action   *type;
> >> +     struct ib_uobject               *uobject;
> >> +     union {
> >> +             struct uverbs_fd_attr           fd;
> >> +             struct uverbs_uobj_attr         uobj;
> >> +     };
> >> +};
> >> +
> >> +struct uverbs_attr {
> >> +     bool valid; > >> +     union {
> >> +             struct uverbs_ptr_attr  cmd_attr;
> >> +             struct uverbs_obj_attr  obj_attr;
> >> +     };
> >> +};
> >
> > It's odd to have a union that's part of a structure without some
> field to indicate which union field is accessible.
> >
> 
> You index this array but the attribute id from the user's callback
> funciton. The user should know what's the type of the attribute, as
> [s]he declared the specification.
> 
> >> +
> >> +/* output of one validator */
> >> +struct uverbs_attr_array {
> >> +     size_t num_attrs;
> >> +     /* arrays of attrubytes, index is the id i.e SEND_CQ */
> >> +     struct uverbs_attr *attrs;
> >> +};
> >> +
> >> +/* =================================================
> >> + *              Types infrastructure
> >> + * =================================================
> >> + */
> >> +
> >> +int ib_uverbs_uobject_type_add(struct list_head      *head,
> >> +                            void (*free)(struct uverbs_uobject_type
> *type,
> >> +                                         struct ib_uobject
> *uobject,
> >> +                                         struct ib_ucontext
> *ucontext),
> >> +                            uint16_t obj_type);
> >> +void ib_uverbs_uobject_types_remove(struct ib_device *ib_dev);
> >> +
> >> +#endif
> >> --
> >> 2.7.4

Matan, please re-look at the architecture that I proposed:

https://patchwork.kernel.org/patch/9178991/

including the terminology (and consider using common OOP terms).  The *core* of the ioctl framework is to simply invoke a function dispatch table.  IMO, that's where we should start.  Anything beyond that is extra that we should have a strong reason for including.  (Yes, I think we need more.)  Starting simple and adding necessary functionality should let us get something upstream quicker and re-use more of the existing code.

If we're going to re-create netlink as part of the rdma ioctl interface, then why don't we just use netlink directly?
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±ÙšŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox