Linux RDMA and InfiniBand development

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

* Re: [PATCH mlx5-next 4/5] net/mlx5: Introduce TLS TX offload hardware bits and structures
From: Leon Romanovsky @ 2019-07-04 17:15 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Saeed Mahameed, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, Eran Ben Elisha, Tariq Toukan
In-Reply-To: <CALzJLG-em1w+Lgf2UutbG2Lzq8bx3zUqoLGx26H2_EXOuuk+jg@mail.gmail.com>

On Thu, Jul 04, 2019 at 01:06:58PM -0400, Saeed Mahameed wrote:
> On Wed, Jul 3, 2019 at 5:27 AM <leon@kernel.org> wrote:
> >
> > On Wed, Jul 03, 2019 at 07:39:32AM +0000, Saeed Mahameed wrote:
> > > From: Eran Ben Elisha <eranbe@mellanox.com>
> > >
> > > Add TLS offload related IFC structs, layouts and enumerations.
> > >
> > > Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
> > > Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
> > > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> > > ---
> > >  include/linux/mlx5/device.h   |  14 +++++
> > >  include/linux/mlx5/mlx5_ifc.h | 104 ++++++++++++++++++++++++++++++++--
> > >  2 files changed, 114 insertions(+), 4 deletions(-)
> >
> > <...>
> >
> > > @@ -2725,7 +2739,8 @@ struct mlx5_ifc_traffic_counter_bits {
> > >
> > >  struct mlx5_ifc_tisc_bits {
> > >       u8         strict_lag_tx_port_affinity[0x1];
> > > -     u8         reserved_at_1[0x3];
> > > +     u8         tls_en[0x1];
> > > +     u8         reserved_at_1[0x2];
> >
> > It should be reserved_at_2.
> >
>
> it should be at_1.

Why? See mlx5_ifc_flow_table_prop_layout_bits, mlx5_ifc_roce_cap_bits, e.t.c.

Thanks

>
> > Thanks

^ permalink raw reply

* Re: [PATCH mlx5-next 0/5] Mellanox, mlx5 low level updates 2019-07-02
From: Saeed Mahameed @ 2019-07-04 17:10 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: netdev@vger.kernel.org, linux-rdma@vger.kernel.org
In-Reply-To: <20190703073909.14965-1-saeedm@mellanox.com>

On Wed, 2019-07-03 at 07:39 +0000, Saeed Mahameed wrote:
> Hi All,
> 
> This series includes some low level updates to mlx5 driver, required
> for
> shared mlx5-next branch.
> 
> Tariq extends the WQE control fields names.
> Eran adds the required HW definitions and structures for upcoming TLS
> support.
> Parav improves and refactors the E-Switch "function changed" handler.
> 
> In case of no objections these patches will be applied to mlx5-next
> and
> will be sent later as pull request to both rdma-next and net-next
> trees.
> 
> Thanks,
> Saeed.

Applied to mlx5-next.


^ permalink raw reply

* Re: [PATCH rdma-next] RDMA/mlx5: Use proper allocation API to get zeroed memory
From: Jason Gunthorpe @ 2019-07-04 17:09 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Doug Ledford, Leon Romanovsky, RDMA mailing list
In-Reply-To: <20190630154832.21388-1-leon@kernel.org>

On Sun, Jun 30, 2019 at 06:48:32PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@mellanox.com>
> 
> There is no need in custom memory zeroing, because it can be done
> by using kzalloc from the beginning.
> 
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> ---
>  drivers/infiniband/hw/mlx5/main.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

Applied to for-next, thanks

Jason

^ permalink raw reply

* Re: [PATCH] RDMA/uverbs: remove redundant assignment to variable ret
From: Jason Gunthorpe @ 2019-07-04 17:07 UTC (permalink / raw)
  To: Colin King; +Cc: Doug Ledford, linux-rdma, kernel-janitors, linux-kernel
In-Reply-To: <20190704125027.4514-1-colin.king@canonical.com>

On Thu, Jul 04, 2019 at 01:50:27PM +0100, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> The variable ret is being initialized with a value that is never
> read and it is being updated later with a new value. The
> initialization is redundant and can be removed.
> 
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>  drivers/infiniband/core/uverbs_cmd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Applied to for-next, thanks

Jason

^ permalink raw reply

* Re: [PATCH mlx5-next 4/5] net/mlx5: Introduce TLS TX offload hardware bits and structures
From: Saeed Mahameed @ 2019-07-04 17:06 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Saeed Mahameed, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, Eran Ben Elisha, Tariq Toukan
In-Reply-To: <20190703092735.GZ4727@mtr-leonro.mtl.com>

On Wed, Jul 3, 2019 at 5:27 AM <leon@kernel.org> wrote:
>
> On Wed, Jul 03, 2019 at 07:39:32AM +0000, Saeed Mahameed wrote:
> > From: Eran Ben Elisha <eranbe@mellanox.com>
> >
> > Add TLS offload related IFC structs, layouts and enumerations.
> >
> > Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
> > Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
> > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> > ---
> >  include/linux/mlx5/device.h   |  14 +++++
> >  include/linux/mlx5/mlx5_ifc.h | 104 ++++++++++++++++++++++++++++++++--
> >  2 files changed, 114 insertions(+), 4 deletions(-)
>
> <...>
>
> > @@ -2725,7 +2739,8 @@ struct mlx5_ifc_traffic_counter_bits {
> >
> >  struct mlx5_ifc_tisc_bits {
> >       u8         strict_lag_tx_port_affinity[0x1];
> > -     u8         reserved_at_1[0x3];
> > +     u8         tls_en[0x1];
> > +     u8         reserved_at_1[0x2];
>
> It should be reserved_at_2.
>

it should be at_1.

> Thanks

^ permalink raw reply

* Re: [PATCH for-next] RDMA/hns: Bugfix for hns Makefile
From: Jason Gunthorpe @ 2019-07-04 17:06 UTC (permalink / raw)
  To: Lijun Ou; +Cc: dledford, leon, linux-rdma, linuxarm
In-Reply-To: <1562221378-73312-1-git-send-email-oulijun@huawei.com>

On Thu, Jul 04, 2019 at 02:22:58PM +0800, Lijun Ou wrote:
> Here has a bug for hns Makefile and will lead to a build error
> when use allmodconfig to build hns driver.
> 
> The build log as follows:
> After merging the rdma tree, today's linux-next build (x86_64
> allmodconfig) failed like this:
> 
> WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_ah.o
> see include/linux/module.h for more information
> WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_alloc.o
> see include/linux/module.h for more information
> WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_cmd.o
> see include/linux/module.h for more information
> WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_cq.o
> see include/linux/module.h for more information
> WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_db.o
> see include/linux/module.h for more information
> WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_hem.o
> see include/linux/module.h for more information
> WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_mr.o
> see include/linux/module.h for more information
> WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_pd.o
> see include/linux/module.h for more information
> WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_qp.o
> see include/linux/module.h for more information
> WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_restrack.o
> see include/linux/module.h for more information
> WARNING: modpost: missing MODULE_LICENSE() in drivers/infiniband/hw/hns/hns_roce_srq.o
> see include/linux/module.h for more information
> ERROR: "hns_roce_bitmap_cleanup" [drivers/infiniband/hw/hns/hns_roce_srq.ko] undefined!
> ERROR: "hns_roce_bitmap_init" [drivers/infiniband/hw/hns/hns_roce_srq.ko] undefined!
> ERROR: "hns_roce_free_cmd_mailbox" [drivers/infiniband/hw/hns/hns_roce_srq.ko] undefined!
> ERROR: "hns_roce_alloc_cmd_mailbox" [drivers/infiniband/hw/hns/hns_roce_srq.ko] undefined!
> ERROR: "hns_roce_table_get" [drivers/infiniband/hw/hns/hns_roce_srq.ko] undefined!
> ERROR: "hns_roce_bitmap_alloc" [drivers/infiniband/hw/hns/hns_roce_srq.ko] undefined!
> ERROR: "hns_roce_table_find" [drivers/infiniband/hw/hns/hns_roce_srq.ko] undefined!
> 
> Fixes: e9816ddf2a33 ("RDMA/hns: Cleanup unnecessary exported symbols")
> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
> Signed-off-by: Xi Wang <wangxi11@huawei.com>
> Signed-off-by: Lijun Ou <oulijun@huawei.com>
> ---
>  drivers/infiniband/hw/hns/Makefile | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Applied to for-next, thanks

Jason

^ permalink raw reply

* Re: [net-next 1/3] ice: Initialize and register platform device to provide RDMA
From: Jason Gunthorpe @ 2019-07-04 13:53 UTC (permalink / raw)
  To: Greg KH
  Cc: Jeff Kirsher, davem@davemloft.net, dledford@redhat.com,
	Tony Nguyen, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	nhorman@redhat.com, sassmann@redhat.com, poswald@suse.com,
	mustafa.ismail@intel.com, shiraz.saleem@intel.com, Dave Ertman,
	Andrew Bowers
In-Reply-To: <20190704134612.GB10963@kroah.com>

On Thu, Jul 04, 2019 at 03:46:12PM +0200, Greg KH wrote:
> On Thu, Jul 04, 2019 at 12:48:29PM +0000, Jason Gunthorpe wrote:
> > On Thu, Jul 04, 2019 at 02:42:47PM +0200, Greg KH wrote:
> > > On Thu, Jul 04, 2019 at 12:37:33PM +0000, Jason Gunthorpe wrote:
> > > > On Thu, Jul 04, 2019 at 02:29:50PM +0200, Greg KH wrote:
> > > > > On Thu, Jul 04, 2019 at 12:16:41PM +0000, Jason Gunthorpe wrote:
> > > > > > On Wed, Jul 03, 2019 at 07:12:50PM -0700, Jeff Kirsher wrote:
> > > > > > > From: Tony Nguyen <anthony.l.nguyen@intel.com>
> > > > > > > 
> > > > > > > The RDMA block does not advertise on the PCI bus or any other bus.
> > > > > > > Thus the ice driver needs to provide access to the RDMA hardware block
> > > > > > > via a virtual bus; utilize the platform bus to provide this access.
> > > > > > > 
> > > > > > > This patch initializes the driver to support RDMA as well as creates
> > > > > > > and registers a platform device for the RDMA driver to register to. At
> > > > > > > this point the driver is fully initialized to register a platform
> > > > > > > driver, however, can not yet register as the ops have not been
> > > > > > > implemented.
> > > > > > 
> > > > > > I think you need Greg's ack on all this driver stuff - particularly
> > > > > > that a platform_device is OK.
> > > > > 
> > > > > A platform_device is almost NEVER ok.
> > > > > 
> > > > > Don't abuse it, make a real device on a real bus.  If you don't have a
> > > > > real bus and just need to create a device to hang other things off of,
> > > > > then use the virtual one, that's what it is there for.
> > > > 
> > > > Ideally I'd like to see all the RDMA drivers that connect to ethernet
> > > > drivers use some similar scheme.
> > > 
> > > Why?  They should be attached to a "real" device, why make any up?
> > 
> > ? A "real" device, like struct pci_device, can only bind to one
> > driver. How can we bind it concurrently to net, rdma, scsi, etc?
> 
> MFD was designed for this very problem.
> 
> > > > This is for a PCI device that plugs into multiple subsystems in the
> > > > kernel, ie it has net driver functionality, rdma functionality, some
> > > > even have SCSI functionality
> > > 
> > > Sounds like a MFD device, why aren't you using that functionality
> > > instead?
> > 
> > This was also my advice, but in another email Jeff says:
> > 
> >   MFD architecture was also considered, and we selected the simpler
> >   platform model. Supporting a MFD architecture would require an
> >   additional MFD core driver, individual platform netdev, RDMA function
> >   drivers, and stripping a large portion of the netdev drivers into
> >   MFD core. The sub-devices registered by MFD core for function
> >   drivers are indeed platform devices.  
> 
> So, "mfd is too hard, let's abuse a platform device" is ok?
> 
> People have been wanting to do MFD drivers for PCI devices for a long
> time, it's about time someone actually did the work for it, I bet it
> will not be all that complex if tiny embedded drivers can do it :)

Okay, sounds like a NAK to me. I'll drop these patches from the RDMA
patchworks and Jeff can work through the MFD stuff first.

Jason

^ permalink raw reply

* Re: [net-next 1/3] ice: Initialize and register platform device to provide RDMA
From: Greg KH @ 2019-07-04 13:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jeff Kirsher, davem@davemloft.net, dledford@redhat.com,
	Tony Nguyen, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	nhorman@redhat.com, sassmann@redhat.com, poswald@suse.com,
	mustafa.ismail@intel.com, shiraz.saleem@intel.com, Dave Ertman,
	Andrew Bowers
In-Reply-To: <20190704124824.GK3401@mellanox.com>

On Thu, Jul 04, 2019 at 12:48:29PM +0000, Jason Gunthorpe wrote:
> On Thu, Jul 04, 2019 at 02:42:47PM +0200, Greg KH wrote:
> > On Thu, Jul 04, 2019 at 12:37:33PM +0000, Jason Gunthorpe wrote:
> > > On Thu, Jul 04, 2019 at 02:29:50PM +0200, Greg KH wrote:
> > > > On Thu, Jul 04, 2019 at 12:16:41PM +0000, Jason Gunthorpe wrote:
> > > > > On Wed, Jul 03, 2019 at 07:12:50PM -0700, Jeff Kirsher wrote:
> > > > > > From: Tony Nguyen <anthony.l.nguyen@intel.com>
> > > > > > 
> > > > > > The RDMA block does not advertise on the PCI bus or any other bus.
> > > > > > Thus the ice driver needs to provide access to the RDMA hardware block
> > > > > > via a virtual bus; utilize the platform bus to provide this access.
> > > > > > 
> > > > > > This patch initializes the driver to support RDMA as well as creates
> > > > > > and registers a platform device for the RDMA driver to register to. At
> > > > > > this point the driver is fully initialized to register a platform
> > > > > > driver, however, can not yet register as the ops have not been
> > > > > > implemented.
> > > > > 
> > > > > I think you need Greg's ack on all this driver stuff - particularly
> > > > > that a platform_device is OK.
> > > > 
> > > > A platform_device is almost NEVER ok.
> > > > 
> > > > Don't abuse it, make a real device on a real bus.  If you don't have a
> > > > real bus and just need to create a device to hang other things off of,
> > > > then use the virtual one, that's what it is there for.
> > > 
> > > Ideally I'd like to see all the RDMA drivers that connect to ethernet
> > > drivers use some similar scheme.
> > 
> > Why?  They should be attached to a "real" device, why make any up?
> 
> ? A "real" device, like struct pci_device, can only bind to one
> driver. How can we bind it concurrently to net, rdma, scsi, etc?

MFD was designed for this very problem.

> > > This is for a PCI device that plugs into multiple subsystems in the
> > > kernel, ie it has net driver functionality, rdma functionality, some
> > > even have SCSI functionality
> > 
> > Sounds like a MFD device, why aren't you using that functionality
> > instead?
> 
> This was also my advice, but in another email Jeff says:
> 
>   MFD architecture was also considered, and we selected the simpler
>   platform model. Supporting a MFD architecture would require an
>   additional MFD core driver, individual platform netdev, RDMA function
>   drivers, and stripping a large portion of the netdev drivers into
>   MFD core. The sub-devices registered by MFD core for function
>   drivers are indeed platform devices.  

So, "mfd is too hard, let's abuse a platform device" is ok?

People have been wanting to do MFD drivers for PCI devices for a long
time, it's about time someone actually did the work for it, I bet it
will not be all that complex if tiny embedded drivers can do it :)

thanks,

greg k-h

^ permalink raw reply

* Re: [RFC rdma-core] verbs: add ibv_export_to_fd man page
From: Yuval Shaia @ 2019-07-04 13:41 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Shamir Rabinovitch, linux-rdma, leon, Santosh Shilimkar
In-Reply-To: <20190702224807.GE11860@ziepe.ca>

On Tue, Jul 02, 2019 at 07:48:07PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 26, 2019 at 03:46:39PM +0300, Yuval Shaia wrote:
> > On Wed, Jun 26, 2019 at 11:36:14AM +0300, Shamir Rabinovitch wrote:
> > > Add the ibv_export_to_fd man page.
> > 
> > This is RFC but still suggesting to give some words here.
> > 
> > Also, subject is incorrect since man page is for all functions involved in
> > the shared-obj mechanism, not only the export_to_fd.
> > 
> > > 
> > > Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
> > >  libibverbs/man/ibv_export_to_fd.3.md | 109 +++++++++++++++++++++++++++
> > >  1 file changed, 109 insertions(+)
> > >  create mode 100644 libibverbs/man/ibv_export_to_fd.3.md
> > > 
> > > diff --git a/libibverbs/man/ibv_export_to_fd.3.md b/libibverbs/man/ibv_export_to_fd.3.md
> > > new file mode 100644
> > > index 00000000..8e3f0fb2
> > > +++ b/libibverbs/man/ibv_export_to_fd.3.md
> > > @@ -0,0 +1,109 @@
> > > +---
> > > +date: 2018-06-26
> > > +footer: libibverbs
> > > +header: "Libibverbs Programmer's Manual"
> > > +layout: page
> > > +license: 'Licensed under the OpenIB.org BSD license (FreeBSD Variant) - See COPYING.md'
> > > +section: 3
> > > +title: ibv_export_to_fd
> > > +tagline: Verbs
> > > +---
> > > +
> > > +# NAME
> > > +
> > > +**ibv_export_to_fd**, **ibv_import_pd**, **ibv_import_mr** - export & import ib hw objects.
> > > +
> > > +# SYNOPSIS
> > > +
> > > +```c
> > > +#include <infiniband/verbs.h>
> > > +
> > > +int ibv_export_to_fd(uint32_t fd,
> > > +                     uint32_t *new_handle,
> > > +                     struct ibv_context *context,
> > > +                     enum uverbs_default_objects type,
> > > +                     uint32_t handle);
> 
> This should probably be some internal function and the exports should
> be type safe just like the imports.

So you suggesting something like this (instead of passing handle as arg):

int ibv_export_pd(uint32_t fd,
		  uint32_t *new_handle,
		  struct ibv_context *context,
		  struct ib_pd* pd);

int ibv_export_mr(uint32_t fd,
		  uint32_t *new_handle,
		  struct ibv_context *context,
		  struct ib_mr* mr);

So the handle is taken internally from the pd or mr  arg.

Are you still ok with new_handle? asking as this is what is used in the
ibv_import_xxx functions.

> 
> > > +struct ibv_pd *ibv_import_pd(struct ibv_context *context,
> > > +                             uint32_t fd,
> > > +                             uint32_t handle);
> > > +
> > > +struct ibv_mr *ibv_import_mr(struct ibv_context *context,
> > > +                             uint32_t fd,
> > > +                             uint32_t handle);
> > > +
> > > +uint32_t ibv_context_to_fd(struct ibv_context *context);
> > > +
> > > +uint32_t ibv_pd_to_handle(struct ibv_pd *pd);
> > > +
> > > +uint32_t ibv_mr_to_handle(struct ibv_mr *mr);
> > 
> > Do you know if extra stuff besides this new file needs to be done so i can
> > do ex man ibv_context_to_fd and get this man page?
> 
> Yes, they need to be setup in cmake with aliases.

Will take care of it, thanks.

> 
> I think this man page is kind of terse for such a complicated
> thing. 
> 
> Ie it doesn't talk about what happens when close() or ibv_destroy_X()
> is called.

We've mentioned that the returned object is like a regular object returned
from (ex) ibv_create_pd and should be destroyed with the corresponding
destroy function.
We can add a note saying that the HW object will be destroyed only when all
reference to it will be destroyed.
Is that enough?

> 
> Jason

^ permalink raw reply

* [PATCH rdma-next 1/2] RDMA/mlx4: Separate creation of RWQ and QP
From: Leon Romanovsky @ 2019-07-04 13:09 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe; +Cc: Leon Romanovsky, RDMA mailing list
In-Reply-To: <20190704130936.8705-1-leon@kernel.org>

From: Leon Romanovsky <leonro@mellanox.com>

The mlx4 WQ is implemented with HW QP without special HW object.
Current implementation which tried to reuse the code did it with
common QP creation flows. Such decision caused to the absence of
mlx4_ib_wq struct, which is needed to ensure proper allocation
of ib_wq inside of IB/core.

Separate create_qp_common() to pure QP flow and to create_rq() for RWQ.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/hw/mlx4/qp.c | 236 +++++++++++++++++++++-----------
 1 file changed, 154 insertions(+), 82 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 82aff2f2fdc2..e409adac4e2e 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -855,12 +855,143 @@ static void mlx4_ib_release_wqn(struct mlx4_ib_ucontext *context,
 	mutex_unlock(&context->wqn_ranges_mutex);
 }
 
-static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
-			    enum mlx4_ib_source_type src,
-			    struct ib_qp_init_attr *init_attr,
+static int create_rq(struct ib_pd *pd, struct ib_qp_init_attr *init_attr,
+		     struct ib_udata *udata, struct mlx4_ib_qp *qp)
+{
+	struct mlx4_ib_dev *dev = to_mdev(pd->device);
+	int qpn;
+	int err;
+	struct mlx4_ib_ucontext *context = rdma_udata_to_drv_context(
+		udata, struct mlx4_ib_ucontext, ibucontext);
+	struct mlx4_ib_cq *mcq;
+	unsigned long flags;
+	int range_size;
+	struct mlx4_ib_create_wq wq;
+	size_t copy_len;
+	int shift;
+	int n;
+
+	qp->mlx4_ib_qp_type = MLX4_IB_QPT_RAW_PACKET;
+
+	mutex_init(&qp->mutex);
+	spin_lock_init(&qp->sq.lock);
+	spin_lock_init(&qp->rq.lock);
+	INIT_LIST_HEAD(&qp->gid_list);
+	INIT_LIST_HEAD(&qp->steering_rules);
+
+	qp->state = IB_QPS_RESET;
+
+	copy_len = min(sizeof(struct mlx4_ib_create_wq), udata->inlen);
+
+	if (ib_copy_from_udata(&wq, udata, copy_len)) {
+		err = -EFAULT;
+		goto err;
+	}
+
+	if (wq.comp_mask || wq.reserved[0] || wq.reserved[1] ||
+	    wq.reserved[2]) {
+		pr_debug("user command isn't supported\n");
+		err = -EOPNOTSUPP;
+		goto err;
+	}
+
+	if (wq.log_range_size > ilog2(dev->dev->caps.max_rss_tbl_sz)) {
+		pr_debug("WQN range size must be equal or smaller than %d\n",
+			 dev->dev->caps.max_rss_tbl_sz);
+		err = -EOPNOTSUPP;
+		goto err;
+	}
+	range_size = 1 << wq.log_range_size;
+
+	if (init_attr->create_flags & IB_QP_CREATE_SCATTER_FCS)
+		qp->flags |= MLX4_IB_QP_SCATTER_FCS;
+
+	err = set_rq_size(dev, &init_attr->cap, true, 1, qp, qp->inl_recv_sz);
+	if (err)
+		goto err;
+
+	qp->sq_no_prefetch = 1;
+	qp->sq.wqe_cnt = 1;
+	qp->sq.wqe_shift = MLX4_IB_MIN_SQ_STRIDE;
+	qp->buf_size = (qp->rq.wqe_cnt << qp->rq.wqe_shift) +
+		       (qp->sq.wqe_cnt << qp->sq.wqe_shift);
+
+	qp->umem = ib_umem_get(udata, wq.buf_addr, qp->buf_size, 0, 0);
+	if (IS_ERR(qp->umem)) {
+		err = PTR_ERR(qp->umem);
+		goto err;
+	}
+
+	n = ib_umem_page_count(qp->umem);
+	shift = mlx4_ib_umem_calc_optimal_mtt_size(qp->umem, 0, &n);
+	err = mlx4_mtt_init(dev->dev, n, shift, &qp->mtt);
+
+	if (err)
+		goto err_buf;
+
+	err = mlx4_ib_umem_write_mtt(dev, &qp->mtt, qp->umem);
+	if (err)
+		goto err_mtt;
+
+	err = mlx4_ib_db_map_user(udata, wq.db_addr, &qp->db);
+	if (err)
+		goto err_mtt;
+	qp->mqp.usage = MLX4_RES_USAGE_USER_VERBS;
+
+	err = mlx4_ib_alloc_wqn(context, qp, range_size, &qpn);
+	if (err)
+		goto err_wrid;
+
+	err = mlx4_qp_alloc(dev->dev, qpn, &qp->mqp);
+	if (err)
+		goto err_qpn;
+
+	/*
+	 * Hardware wants QPN written in big-endian order (after
+	 * shifting) for send doorbell.  Precompute this value to save
+	 * a little bit when posting sends.
+	 */
+	qp->doorbell_qpn = swab32(qp->mqp.qpn << 8);
+
+	qp->mqp.event = mlx4_ib_wq_event;
+
+	spin_lock_irqsave(&dev->reset_flow_resource_lock, flags);
+	mlx4_ib_lock_cqs(to_mcq(init_attr->send_cq),
+			 to_mcq(init_attr->recv_cq));
+	/* Maintain device to QPs access, needed for further handling
+	 * via reset flow
+	 */
+	list_add_tail(&qp->qps_list, &dev->qp_list);
+	/* Maintain CQ to QPs access, needed for further handling
+	 * via reset flow
+	 */
+	mcq = to_mcq(init_attr->send_cq);
+	list_add_tail(&qp->cq_send_list, &mcq->send_qp_list);
+	mcq = to_mcq(init_attr->recv_cq);
+	list_add_tail(&qp->cq_recv_list, &mcq->recv_qp_list);
+	mlx4_ib_unlock_cqs(to_mcq(init_attr->send_cq),
+			   to_mcq(init_attr->recv_cq));
+	spin_unlock_irqrestore(&dev->reset_flow_resource_lock, flags);
+	return 0;
+
+err_qpn:
+	mlx4_ib_release_wqn(context, qp, 0);
+err_wrid:
+	mlx4_ib_db_unmap_user(context, &qp->db);
+
+err_mtt:
+	mlx4_mtt_cleanup(dev->dev, &qp->mtt);
+err_buf:
+	ib_umem_release(qp->umem);
+err:
+	return err;
+}
+
+static int create_qp_common(struct ib_pd *pd, struct ib_qp_init_attr *init_attr,
 			    struct ib_udata *udata, int sqpn,
 			    struct mlx4_ib_qp **caller_qp)
 {
+	struct mlx4_ib_dev *dev = to_mdev(pd->device);
 	int qpn;
 	int err;
 	struct mlx4_ib_sqp *sqp = NULL;
@@ -870,7 +1001,6 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 	enum mlx4_ib_qp_type qp_type = (enum mlx4_ib_qp_type) init_attr->qp_type;
 	struct mlx4_ib_cq *mcq;
 	unsigned long flags;
-	int range_size = 0;
 
 	/* When tunneling special qps, we use a plain UD qp */
 	if (sqpn) {
@@ -921,15 +1051,13 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 			if (!sqp)
 				return -ENOMEM;
 			qp = &sqp->qp;
-			qp->pri.vid = 0xFFFF;
-			qp->alt.vid = 0xFFFF;
 		} else {
 			qp = kzalloc(sizeof(struct mlx4_ib_qp), GFP_KERNEL);
 			if (!qp)
 				return -ENOMEM;
-			qp->pri.vid = 0xFFFF;
-			qp->alt.vid = 0xFFFF;
 		}
+		qp->pri.vid = 0xFFFF;
+		qp->alt.vid = 0xFFFF;
 	} else
 		qp = *caller_qp;
 
@@ -941,48 +1069,24 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 	INIT_LIST_HEAD(&qp->gid_list);
 	INIT_LIST_HEAD(&qp->steering_rules);
 
-	qp->state	 = IB_QPS_RESET;
+	qp->state = IB_QPS_RESET;
 	if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR)
 		qp->sq_signal_bits = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE);
 
-
 	if (udata) {
-		union {
-			struct mlx4_ib_create_qp qp;
-			struct mlx4_ib_create_wq wq;
-		} ucmd;
+		struct mlx4_ib_create_qp ucmd;
 		size_t copy_len;
 		int shift;
 		int n;
 
-		copy_len = (src == MLX4_IB_QP_SRC) ?
-			   sizeof(struct mlx4_ib_create_qp) :
-			   min(sizeof(struct mlx4_ib_create_wq), udata->inlen);
+		copy_len = sizeof(struct mlx4_ib_create_qp);
 
 		if (ib_copy_from_udata(&ucmd, udata, copy_len)) {
 			err = -EFAULT;
 			goto err;
 		}
 
-		if (src == MLX4_IB_RWQ_SRC) {
-			if (ucmd.wq.comp_mask || ucmd.wq.reserved[0] ||
-			    ucmd.wq.reserved[1] || ucmd.wq.reserved[2]) {
-				pr_debug("user command isn't supported\n");
-				err = -EOPNOTSUPP;
-				goto err;
-			}
-
-			if (ucmd.wq.log_range_size >
-			    ilog2(dev->dev->caps.max_rss_tbl_sz)) {
-				pr_debug("WQN range size must be equal or smaller than %d\n",
-					 dev->dev->caps.max_rss_tbl_sz);
-				err = -EOPNOTSUPP;
-				goto err;
-			}
-			range_size = 1 << ucmd.wq.log_range_size;
-		} else {
-			qp->inl_recv_sz = ucmd.qp.inl_recv_sz;
-		}
+		qp->inl_recv_sz = ucmd.inl_recv_sz;
 
 		if (init_attr->create_flags & IB_QP_CREATE_SCATTER_FCS) {
 			if (!(dev->dev->caps.flags &
@@ -1000,30 +1104,14 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 		if (err)
 			goto err;
 
-		if (src == MLX4_IB_QP_SRC) {
-			qp->sq_no_prefetch = ucmd.qp.sq_no_prefetch;
+		qp->sq_no_prefetch = ucmd.sq_no_prefetch;
 
-			err = set_user_sq_size(dev, qp,
-					       (struct mlx4_ib_create_qp *)
-					       &ucmd);
-			if (err)
-				goto err;
-		} else {
-			qp->sq_no_prefetch = 1;
-			qp->sq.wqe_cnt = 1;
-			qp->sq.wqe_shift = MLX4_IB_MIN_SQ_STRIDE;
-			/* Allocated buffer expects to have at least that SQ
-			 * size.
-			 */
-			qp->buf_size = (qp->rq.wqe_cnt << qp->rq.wqe_shift) +
-				(qp->sq.wqe_cnt << qp->sq.wqe_shift);
-		}
+		err = set_user_sq_size(dev, qp, &ucmd);
+		if (err)
+			goto err;
 
 		qp->umem =
-			ib_umem_get(udata,
-				    (src == MLX4_IB_QP_SRC) ? ucmd.qp.buf_addr :
-							      ucmd.wq.buf_addr,
-				    qp->buf_size, 0, 0);
+			ib_umem_get(udata, ucmd.buf_addr, qp->buf_size, 0, 0);
 		if (IS_ERR(qp->umem)) {
 			err = PTR_ERR(qp->umem);
 			goto err;
@@ -1041,11 +1129,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 			goto err_mtt;
 
 		if (qp_has_rq(init_attr)) {
-			err = mlx4_ib_db_map_user(udata,
-						  (src == MLX4_IB_QP_SRC) ?
-							  ucmd.qp.db_addr :
-							  ucmd.wq.db_addr,
-						  &qp->db);
+			err = mlx4_ib_db_map_user(udata, ucmd.db_addr, &qp->db);
 			if (err)
 				goto err_mtt;
 		}
@@ -1115,10 +1199,6 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 				goto err_wrid;
 			}
 		}
-	} else if (src == MLX4_IB_RWQ_SRC) {
-		err = mlx4_ib_alloc_wqn(context, qp, range_size, &qpn);
-		if (err)
-			goto err_wrid;
 	} else {
 		/* Raw packet QPNs may not have bits 6,7 set in their qp_num;
 		 * otherwise, the WQE BlueFlame setup flow wrongly causes
@@ -1157,8 +1237,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 	 */
 	qp->doorbell_qpn = swab32(qp->mqp.qpn << 8);
 
-	qp->mqp.event = (src == MLX4_IB_QP_SRC) ? mlx4_ib_qp_event :
-						  mlx4_ib_wq_event;
+	qp->mqp.event = mlx4_ib_qp_event;
 
 	if (!*caller_qp)
 		*caller_qp = qp;
@@ -1186,8 +1265,6 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 	if (!sqpn) {
 		if (qp->flags & MLX4_IB_QP_NETIF)
 			mlx4_ib_steer_qp_free(dev, qpn, 1);
-		else if (src == MLX4_IB_RWQ_SRC)
-			mlx4_ib_release_wqn(context, qp, 0);
 		else
 			mlx4_qp_release_range(dev->dev, qpn, 1);
 	}
@@ -1518,8 +1595,7 @@ static struct ib_qp *_mlx4_ib_create_qp(struct ib_pd *pd,
 		/* fall through */
 	case IB_QPT_UD:
 	{
-		err = create_qp_common(to_mdev(pd->device), pd,	MLX4_IB_QP_SRC,
-				       init_attr, udata, 0, &qp);
+		err = create_qp_common(pd, init_attr, udata, 0, &qp);
 		if (err) {
 			kfree(qp);
 			return ERR_PTR(err);
@@ -1549,8 +1625,7 @@ static struct ib_qp *_mlx4_ib_create_qp(struct ib_pd *pd,
 			sqpn = get_sqp_num(to_mdev(pd->device), init_attr);
 		}
 
-		err = create_qp_common(to_mdev(pd->device), pd, MLX4_IB_QP_SRC,
-				       init_attr, udata, sqpn, &qp);
+		err = create_qp_common(pd, init_attr, udata, sqpn, &qp);
 		if (err)
 			return ERR_PTR(err);
 
@@ -4047,8 +4122,8 @@ struct ib_wq *mlx4_ib_create_wq(struct ib_pd *pd,
 				struct ib_wq_init_attr *init_attr,
 				struct ib_udata *udata)
 {
-	struct mlx4_ib_dev *dev;
-	struct ib_qp_init_attr ib_qp_init_attr;
+	struct mlx4_dev *dev = to_mdev(pd->device)->dev;
+	struct ib_qp_init_attr ib_qp_init_attr = {};
 	struct mlx4_ib_qp *qp;
 	struct mlx4_ib_create_wq ucmd;
 	int err, required_cmd_sz;
@@ -4073,14 +4148,13 @@ struct ib_wq *mlx4_ib_create_wq(struct ib_pd *pd,
 	if (udata->outlen)
 		return ERR_PTR(-EOPNOTSUPP);
 
-	dev = to_mdev(pd->device);
-
 	if (init_attr->wq_type != IB_WQT_RQ) {
 		pr_debug("unsupported wq type %d\n", init_attr->wq_type);
 		return ERR_PTR(-EOPNOTSUPP);
 	}
 
-	if (init_attr->create_flags & ~IB_WQ_FLAGS_SCATTER_FCS) {
+	if (init_attr->create_flags & ~IB_WQ_FLAGS_SCATTER_FCS ||
+	    !(dev->caps.flags & MLX4_DEV_CAP_FLAG_FCS_KEEP)) {
 		pr_debug("unsupported create_flags %u\n",
 			 init_attr->create_flags);
 		return ERR_PTR(-EOPNOTSUPP);
@@ -4093,7 +4167,6 @@ struct ib_wq *mlx4_ib_create_wq(struct ib_pd *pd,
 	qp->pri.vid = 0xFFFF;
 	qp->alt.vid = 0xFFFF;
 
-	memset(&ib_qp_init_attr, 0, sizeof(ib_qp_init_attr));
 	ib_qp_init_attr.qp_context = init_attr->wq_context;
 	ib_qp_init_attr.qp_type = IB_QPT_RAW_PACKET;
 	ib_qp_init_attr.cap.max_recv_wr = init_attr->max_wr;
@@ -4104,8 +4177,7 @@ struct ib_wq *mlx4_ib_create_wq(struct ib_pd *pd,
 	if (init_attr->create_flags & IB_WQ_FLAGS_SCATTER_FCS)
 		ib_qp_init_attr.create_flags |= IB_QP_CREATE_SCATTER_FCS;
 
-	err = create_qp_common(dev, pd, MLX4_IB_RWQ_SRC, &ib_qp_init_attr,
-			       udata, 0, &qp);
+	err = create_rq(pd, &ib_qp_init_attr, udata, qp);
 	if (err) {
 		kfree(qp);
 		return ERR_PTR(err);
-- 
2.20.1


^ permalink raw reply related

* [PATCH rdma-next 2/2] RDMA/mlx4: Annotate boolean arguments as bool and not int
From: Leon Romanovsky @ 2019-07-04 13:09 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe; +Cc: Leon Romanovsky, RDMA mailing list
In-Reply-To: <20190704130936.8705-1-leon@kernel.org>

From: Leon Romanovsky <leonro@mellanox.com>

Information provided by qp_has_rq() and used latter is boolean,
so update callers to proper type.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/hw/mlx4/qp.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index e409adac4e2e..bd4aa04416c6 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -325,7 +325,7 @@ static int send_wqe_overhead(enum mlx4_ib_qp_type type, u32 flags)
 }
 
 static int set_rq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap,
-		       bool is_user, int has_rq, struct mlx4_ib_qp *qp,
+		       bool is_user, bool has_rq, struct mlx4_ib_qp *qp,
 		       u32 inl_recv_sz)
 {
 	/* Sanity check RQ size before proceeding */
@@ -506,10 +506,10 @@ static void free_proxy_bufs(struct ib_device *dev, struct mlx4_ib_qp *qp)
 	kfree(qp->sqp_proxy_rcv);
 }
 
-static int qp_has_rq(struct ib_qp_init_attr *attr)
+static bool qp_has_rq(struct ib_qp_init_attr *attr)
 {
 	if (attr->qp_type == IB_QPT_XRC_INI || attr->qp_type == IB_QPT_XRC_TGT)
-		return 0;
+		return false;
 
 	return !attr->srq;
 }
@@ -906,7 +906,7 @@ static int create_rq(struct ib_pd *pd, struct ib_qp_init_attr *init_attr,
 	if (init_attr->create_flags & IB_QP_CREATE_SCATTER_FCS)
 		qp->flags |= MLX4_IB_QP_SCATTER_FCS;
 
-	err = set_rq_size(dev, &init_attr->cap, true, 1, qp, qp->inl_recv_sz);
+	err = set_rq_size(dev, &init_attr->cap, true, true, qp, qp->inl_recv_sz);
 	if (err)
 		goto err;
 
-- 
2.20.1


^ permalink raw reply related

* [PATCH rdma-next 0/2] Initial code to cleanup RWQ
From: Leon Romanovsky @ 2019-07-04 13:09 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe; +Cc: Leon Romanovsky, RDMA mailing list

From: Leon Romanovsky <leonro@mellanox.com>

Inside mlx4 RWQ is integrated into QP, this complicates my deallocation
patches. This is small set of patches which I carried for whole cycle.

Thanks

Leon Romanovsky (2):
  RDMA/mlx4: Separate creation of RWQ and QP
  RDMA/mlx4: Annotate boolean arguments as bool and not int

 drivers/infiniband/hw/mlx4/qp.c | 242 +++++++++++++++++++++-----------
 1 file changed, 157 insertions(+), 85 deletions(-)

--
2.20.1


^ permalink raw reply

* [PATCH rdma-next 1/2] IB/core: Work on the caller socket net namespace in nldev_newlink()
From: Leon Romanovsky @ 2019-07-04 13:04 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Parav Pandit, Steve Wise
In-Reply-To: <20190704130402.8431-1-leon@kernel.org>

From: Parav Pandit <parav@mellanox.com>

While creating new RDMA devices based on netdevice name,
consider the net namespace of the caller skb's socket similar to rest of
the doit() callbacks and nldev_dellink() which deletes the RDMA device
created using nldev_newlink().

Fixes: 3856ec4b93c94 ("RDMA/core: Add RDMA_NLDEV_CMD_NEWLINK/DELLINK support")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/core/nldev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index d9f2a30e6467..783e465e7c41 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -1476,7 +1476,7 @@ static int nldev_newlink(struct sk_buff *skb, struct nlmsghdr *nlh,
 	nla_strlcpy(ndev_name, tb[RDMA_NLDEV_ATTR_NDEV_NAME],
 		    sizeof(ndev_name));

-	ndev = dev_get_by_name(&init_net, ndev_name);
+	ndev = dev_get_by_name(sock_net(skb->sk), ndev_name);
 	if (!ndev)
 		return -ENODEV;

--
2.20.1


^ permalink raw reply related

* [PATCH rdma-next 2/2] IB: Support netlink commands in non init_net net namespaces
From: Leon Romanovsky @ 2019-07-04 13:04 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Parav Pandit, Steve Wise
In-Reply-To: <20190704130402.8431-1-leon@kernel.org>

From: Parav Pandit <parav@mellanox.com>

Now that IB core supports RDMA device binding with specific net
namespace, enable IB core to accept netlink commands in non init_net
namespaces.

This is done by having per net namespace netlink socket.

At present only netlink device handling client RDMA_NL_NLDEV supports
device handling in multiple net namespaces.
Hence do not accept netlink messages for other clients in non init_net
net namespaces.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/core/addr.c      |  2 +-
 drivers/infiniband/core/core_priv.h | 19 +++++++++--
 drivers/infiniband/core/device.c    | 34 ++++++------------
 drivers/infiniband/core/iwpm_msg.c  |  8 ++---
 drivers/infiniband/core/iwpm_util.c |  6 ++--
 drivers/infiniband/core/netlink.c   | 53 ++++++++++++++++++-----------
 drivers/infiniband/core/nldev.c     | 20 +++++------
 drivers/infiniband/core/sa_query.c  |  2 +-
 include/rdma/rdma_netlink.h         | 10 ++++--
 9 files changed, 86 insertions(+), 68 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 2f7d14159841..9be0400a9a51 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -183,7 +183,7 @@ static int ib_nl_ip_send_msg(struct rdma_dev_addr *dev_addr,

 	/* Repair the nlmsg header length */
 	nlmsg_end(skb, nlh);
-	rdma_nl_multicast(skb, RDMA_NL_GROUP_LS, GFP_KERNEL);
+	rdma_nl_multicast(&init_net, skb, RDMA_NL_GROUP_LS, GFP_KERNEL);

 	/* Make the request retry, so when we get the response from userspace
 	 * we will have something.
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index 888d89ce81df..b441a31fd731 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -54,8 +54,21 @@ struct pkey_index_qp_list {
 	struct list_head    qp_list;
 };

+/**
+ * struct rdma_dev_net - rdma net namespace metadata for a net
+ * @nl_sock:	Pointer to netlink socket
+ * @net:	Pointer to owner net namespace
+ * @id:		xarray id to identify the net namespace.
+ */
+struct rdma_dev_net {
+	struct sock *nl_sock;
+	possible_net_t net;
+	u32 id;
+};
+
 extern const struct attribute_group ib_dev_attr_group;
 extern bool ib_devices_shared_netns;
+extern unsigned int rdma_dev_net_id;

 int ib_device_register_sysfs(struct ib_device *device);
 void ib_device_unregister_sysfs(struct ib_device *device);
@@ -179,9 +192,6 @@ void ib_mad_cleanup(void);
 int ib_sa_init(void);
 void ib_sa_cleanup(void);

-int rdma_nl_init(void);
-void rdma_nl_exit(void);
-
 int ib_nl_handle_resolve_resp(struct sk_buff *skb,
 			      struct nlmsghdr *nlh,
 			      struct netlink_ext_ack *extack);
@@ -362,4 +372,7 @@ void ib_port_unregister_module_stat(struct kobject *kobj);

 int ib_device_set_netns_put(struct sk_buff *skb,
 			    struct ib_device *dev, u32 ns_fd);
+
+int rdma_nl_net_init(struct rdma_dev_net *rnet);
+void rdma_nl_net_exit(struct rdma_dev_net *rnet);
 #endif /* _CORE_PRIV_H */
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index adf8d93bb42d..b404a8dfea89 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -105,17 +105,7 @@ static DECLARE_RWSEM(clients_rwsem);
  */
 #define CLIENT_DATA_REGISTERED XA_MARK_1

-/**
- * struct rdma_dev_net - rdma net namespace metadata for a net
- * @net:	Pointer to owner net namespace
- * @id:		xarray id to identify the net namespace.
- */
-struct rdma_dev_net {
-	possible_net_t net;
-	u32 id;
-};
-
-static unsigned int rdma_dev_net_id;
+unsigned int rdma_dev_net_id;

 /*
  * A list of net namespaces is maintained in an xarray. This is necessary
@@ -1084,6 +1074,7 @@ static void rdma_dev_exit_net(struct net *net)
 	}
 	up_read(&devices_rwsem);

+	rdma_nl_net_exit(rnet);
 	xa_erase(&rdma_nets, rnet->id);
 }

@@ -1094,15 +1085,21 @@ static __net_init int rdma_dev_init_net(struct net *net)
 	struct ib_device *dev;
 	int ret;

+	write_pnet(&rnet->net, net);
+
+	ret = rdma_nl_net_init(rnet);
+	if (ret)
+		return ret;
+
 	/* No need to create any compat devices in default init_net. */
 	if (net_eq(net, &init_net))
 		return 0;

-	write_pnet(&rnet->net, net);
-
 	ret = xa_alloc(&rdma_nets, &rnet->id, rnet, xa_limit_32b, GFP_KERNEL);
-	if (ret)
+	if (ret) {
+		rdma_nl_net_exit(rnet);
 		return ret;
+	}

 	down_read(&devices_rwsem);
 	xa_for_each_marked (&devices, index, dev, DEVICE_REGISTERED) {
@@ -2629,12 +2626,6 @@ static int __init ib_core_init(void)
 		goto err_comp_unbound;
 	}

-	ret = rdma_nl_init();
-	if (ret) {
-		pr_warn("Couldn't init IB netlink interface: err %d\n", ret);
-		goto err_sysfs;
-	}
-
 	ret = addr_init();
 	if (ret) {
 		pr_warn("Could't init IB address resolution\n");
@@ -2680,8 +2671,6 @@ static int __init ib_core_init(void)
 err_addr:
 	addr_cleanup();
 err_ibnl:
-	rdma_nl_exit();
-err_sysfs:
 	class_unregister(&ib_class);
 err_comp_unbound:
 	destroy_workqueue(ib_comp_unbound_wq);
@@ -2702,7 +2691,6 @@ static void __exit ib_core_cleanup(void)
 	ib_sa_cleanup();
 	ib_mad_cleanup();
 	addr_cleanup();
-	rdma_nl_exit();
 	class_unregister(&ib_class);
 	destroy_workqueue(ib_comp_unbound_wq);
 	destroy_workqueue(ib_comp_wq);
diff --git a/drivers/infiniband/core/iwpm_msg.c b/drivers/infiniband/core/iwpm_msg.c
index 2452b0ddcf0d..f1a873d4e842 100644
--- a/drivers/infiniband/core/iwpm_msg.c
+++ b/drivers/infiniband/core/iwpm_msg.c
@@ -112,7 +112,7 @@ int iwpm_register_pid(struct iwpm_dev_data *pm_msg, u8 nl_client)
 	pr_debug("%s: Multicasting a nlmsg (dev = %s ifname = %s iwpm = %s)\n",
 		__func__, pm_msg->dev_name, pm_msg->if_name, iwpm_ulib_name);

-	ret = rdma_nl_multicast(skb, RDMA_NL_GROUP_IWPM, GFP_KERNEL);
+	ret = rdma_nl_multicast(&init_net, skb, RDMA_NL_GROUP_IWPM, GFP_KERNEL);
 	if (ret) {
 		skb = NULL; /* skb is freed in the netlink send-op handling */
 		iwpm_user_pid = IWPM_PID_UNAVAILABLE;
@@ -202,7 +202,7 @@ int iwpm_add_mapping(struct iwpm_sa_data *pm_msg, u8 nl_client)
 	nlmsg_end(skb, nlh);
 	nlmsg_request->req_buffer = pm_msg;

-	ret = rdma_nl_unicast_wait(skb, iwpm_user_pid);
+	ret = rdma_nl_unicast_wait(&init_net, skb, iwpm_user_pid);
 	if (ret) {
 		skb = NULL; /* skb is freed in the netlink send-op handling */
 		iwpm_user_pid = IWPM_PID_UNDEFINED;
@@ -297,7 +297,7 @@ int iwpm_add_and_query_mapping(struct iwpm_sa_data *pm_msg, u8 nl_client)
 	nlmsg_end(skb, nlh);
 	nlmsg_request->req_buffer = pm_msg;

-	ret = rdma_nl_unicast_wait(skb, iwpm_user_pid);
+	ret = rdma_nl_unicast_wait(&init_net, skb, iwpm_user_pid);
 	if (ret) {
 		skb = NULL; /* skb is freed in the netlink send-op handling */
 		err_str = "Unable to send a nlmsg";
@@ -364,7 +364,7 @@ int iwpm_remove_mapping(struct sockaddr_storage *local_addr, u8 nl_client)

 	nlmsg_end(skb, nlh);

-	ret = rdma_nl_unicast_wait(skb, iwpm_user_pid);
+	ret = rdma_nl_unicast_wait(&init_net, skb, iwpm_user_pid);
 	if (ret) {
 		skb = NULL; /* skb is freed in the netlink send-op handling */
 		iwpm_user_pid = IWPM_PID_UNDEFINED;
diff --git a/drivers/infiniband/core/iwpm_util.c b/drivers/infiniband/core/iwpm_util.c
index 41929bb83739..c7ad3499228c 100644
--- a/drivers/infiniband/core/iwpm_util.c
+++ b/drivers/infiniband/core/iwpm_util.c
@@ -645,7 +645,7 @@ static int send_mapinfo_num(u32 mapping_num, u8 nl_client, int iwpm_pid)

 	nlmsg_end(skb, nlh);

-	ret = rdma_nl_unicast(skb, iwpm_pid);
+	ret = rdma_nl_unicast(&init_net, skb, iwpm_pid);
 	if (ret) {
 		skb = NULL;
 		err_str = "Unable to send a nlmsg";
@@ -674,7 +674,7 @@ static int send_nlmsg_done(struct sk_buff *skb, u8 nl_client, int iwpm_pid)
 		return -ENOMEM;
 	}
 	nlh->nlmsg_type = NLMSG_DONE;
-	ret = rdma_nl_unicast(skb, iwpm_pid);
+	ret = rdma_nl_unicast(&init_net, skb, iwpm_pid);
 	if (ret)
 		pr_warn("%s Unable to send a nlmsg\n", __func__);
 	return ret;
@@ -824,7 +824,7 @@ int iwpm_send_hello(u8 nl_client, int iwpm_pid, u16 abi_version)
 		goto hello_num_error;
 	nlmsg_end(skb, nlh);

-	ret = rdma_nl_unicast(skb, iwpm_pid);
+	ret = rdma_nl_unicast(&init_net, skb, iwpm_pid);
 	if (ret) {
 		skb = NULL;
 		err_str = "Unable to send a nlmsg";
diff --git a/drivers/infiniband/core/netlink.c b/drivers/infiniband/core/netlink.c
index eecfc0b377c9..676db08e7b4e 100644
--- a/drivers/infiniband/core/netlink.c
+++ b/drivers/infiniband/core/netlink.c
@@ -36,20 +36,22 @@
 #include <linux/export.h>
 #include <net/netlink.h>
 #include <net/net_namespace.h>
+#include <net/netns/generic.h>
 #include <net/sock.h>
 #include <rdma/rdma_netlink.h>
 #include <linux/module.h>
 #include "core_priv.h"

 static DEFINE_MUTEX(rdma_nl_mutex);
-static struct sock *nls;
 static struct {
 	const struct rdma_nl_cbs   *cb_table;
 } rdma_nl_types[RDMA_NL_NUM_CLIENTS];

 bool rdma_nl_chk_listeners(unsigned int group)
 {
-	return netlink_has_listeners(nls, group);
+	struct rdma_dev_net *rnet = net_generic(&init_net, rdma_dev_net_id);
+
+	return netlink_has_listeners(rnet->nl_sock, group);
 }
 EXPORT_SYMBOL(rdma_nl_chk_listeners);

@@ -73,13 +75,21 @@ static bool is_nl_msg_valid(unsigned int type, unsigned int op)
 	return (op < max_num_ops[type]) ? true : false;
 }

-static bool is_nl_valid(unsigned int type, unsigned int op)
+static bool
+is_nl_valid(const struct sk_buff *skb, unsigned int type, unsigned int op)
 {
 	const struct rdma_nl_cbs *cb_table;

 	if (!is_nl_msg_valid(type, op))
 		return false;

+	/*
+	 * Currently only NLDEV client is supporting netlink commands in
+	 * non init_net net namespace.
+	 */
+	if (sock_net(skb->sk) != &init_net && type != RDMA_NL_NLDEV)
+		return false;
+
 	if (!rdma_nl_types[type].cb_table) {
 		mutex_unlock(&rdma_nl_mutex);
 		request_module("rdma-netlink-subsys-%d", type);
@@ -161,7 +171,7 @@ static int rdma_nl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
 	unsigned int op = RDMA_NL_GET_OP(type);
 	const struct rdma_nl_cbs *cb_table;

-	if (!is_nl_valid(index, op))
+	if (!is_nl_valid(skb, index, op))
 		return -EINVAL;

 	cb_table = rdma_nl_types[index].cb_table;
@@ -185,7 +195,7 @@ static int rdma_nl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
 			.dump = cb_table[op].dump,
 		};
 		if (c.dump)
-			return netlink_dump_start(nls, skb, nlh, &c);
+			return netlink_dump_start(skb->sk, skb, nlh, &c);
 		return -EINVAL;
 	}

@@ -258,52 +268,55 @@ static void rdma_nl_rcv(struct sk_buff *skb)
 	mutex_unlock(&rdma_nl_mutex);
 }

-int rdma_nl_unicast(struct sk_buff *skb, u32 pid)
+int rdma_nl_unicast(struct net *net, struct sk_buff *skb, u32 pid)
 {
+	struct rdma_dev_net *rnet = net_generic(net, rdma_dev_net_id);
 	int err;

-	err = netlink_unicast(nls, skb, pid, MSG_DONTWAIT);
+	err = netlink_unicast(rnet->nl_sock, skb, pid, MSG_DONTWAIT);
 	return (err < 0) ? err : 0;
 }
 EXPORT_SYMBOL(rdma_nl_unicast);

-int rdma_nl_unicast_wait(struct sk_buff *skb, __u32 pid)
+int rdma_nl_unicast_wait(struct net *net, struct sk_buff *skb, __u32 pid)
 {
+	struct rdma_dev_net *rnet = net_generic(net, rdma_dev_net_id);
 	int err;

-	err = netlink_unicast(nls, skb, pid, 0);
+	err = netlink_unicast(rnet->nl_sock, skb, pid, 0);
 	return (err < 0) ? err : 0;
 }
 EXPORT_SYMBOL(rdma_nl_unicast_wait);

-int rdma_nl_multicast(struct sk_buff *skb, unsigned int group, gfp_t flags)
+int rdma_nl_multicast(struct net *net, struct sk_buff *skb,
+		      unsigned int group, gfp_t flags)
 {
-	return nlmsg_multicast(nls, skb, 0, group, flags);
+	struct rdma_dev_net *rnet = net_generic(net, rdma_dev_net_id);
+
+	return nlmsg_multicast(rnet->nl_sock, skb, 0, group, flags);
 }
 EXPORT_SYMBOL(rdma_nl_multicast);

-int __init rdma_nl_init(void)
+int rdma_nl_net_init(struct rdma_dev_net *rnet)
 {
+	struct net *net = read_pnet(&rnet->net);
 	struct netlink_kernel_cfg cfg = {
 		.input	= rdma_nl_rcv,
 	};
+	struct sock *nls;

-	nls = netlink_kernel_create(&init_net, NETLINK_RDMA, &cfg);
+	nls = netlink_kernel_create(net, NETLINK_RDMA, &cfg);
 	if (!nls)
 		return -ENOMEM;

 	nls->sk_sndtimeo = 10 * HZ;
+	rnet->nl_sock = nls;
 	return 0;
 }

-void rdma_nl_exit(void)
+void rdma_nl_net_exit(struct rdma_dev_net *rnet)
 {
-	int idx;
-
-	for (idx = 0; idx < RDMA_NL_NUM_CLIENTS; idx++)
-		rdma_nl_unregister(idx);
-
-	netlink_kernel_release(nls);
+	netlink_kernel_release(rnet->nl_sock);
 }

 MODULE_ALIAS_NET_PF_PROTO(PF_NETLINK, NETLINK_RDMA);
diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index 783e465e7c41..e287b71a1cfd 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -832,7 +832,7 @@ static int nldev_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 	nlmsg_end(msg, nlh);

 	ib_device_put(device);
-	return rdma_nl_unicast(msg, NETLINK_CB(skb).portid);
+	return rdma_nl_unicast(sock_net(skb->sk), msg, NETLINK_CB(skb).portid);

 err_free:
 	nlmsg_free(msg);
@@ -972,7 +972,7 @@ static int nldev_port_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 	nlmsg_end(msg, nlh);
 	ib_device_put(device);

-	return rdma_nl_unicast(msg, NETLINK_CB(skb).portid);
+	return rdma_nl_unicast(sock_net(skb->sk), msg, NETLINK_CB(skb).portid);

 err_free:
 	nlmsg_free(msg);
@@ -1074,7 +1074,7 @@ static int nldev_res_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh,

 	nlmsg_end(msg, nlh);
 	ib_device_put(device);
-	return rdma_nl_unicast(msg, NETLINK_CB(skb).portid);
+	return rdma_nl_unicast(sock_net(skb->sk), msg, NETLINK_CB(skb).portid);

 err_free:
 	nlmsg_free(msg);
@@ -1251,7 +1251,7 @@ static int res_get_common_doit(struct sk_buff *skb, struct nlmsghdr *nlh,

 	nlmsg_end(msg, nlh);
 	ib_device_put(device);
-	return rdma_nl_unicast(msg, NETLINK_CB(skb).portid);
+	return rdma_nl_unicast(sock_net(skb->sk), msg, NETLINK_CB(skb).portid);

 err_free:
 	nlmsg_free(msg);
@@ -1596,7 +1596,7 @@ static int nldev_get_chardev(struct sk_buff *skb, struct nlmsghdr *nlh,
 	put_device(data.cdev);
 	if (ibdev)
 		ib_device_put(ibdev);
-	return rdma_nl_unicast(msg, NETLINK_CB(skb).portid);
+	return rdma_nl_unicast(sock_net(skb->sk), msg, NETLINK_CB(skb).portid);

 out_data:
 	put_device(data.cdev);
@@ -1636,7 +1636,7 @@ static int nldev_sys_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 		return err;
 	}
 	nlmsg_end(msg, nlh);
-	return rdma_nl_unicast(msg, NETLINK_CB(skb).portid);
+	return rdma_nl_unicast(sock_net(skb->sk), msg, NETLINK_CB(skb).portid);
 }

 static int nldev_set_sys_set_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
@@ -1734,7 +1734,7 @@ static int nldev_stat_set_doit(struct sk_buff *skb, struct nlmsghdr *nlh,

 	nlmsg_end(msg, nlh);
 	ib_device_put(device);
-	return rdma_nl_unicast(msg, NETLINK_CB(skb).portid);
+	return rdma_nl_unicast(sock_net(skb->sk), msg, NETLINK_CB(skb).portid);

 err_fill:
 	rdma_counter_unbind_qpn(device, port, qpn, cntn);
@@ -1802,7 +1802,7 @@ static int nldev_stat_del_doit(struct sk_buff *skb, struct nlmsghdr *nlh,

 	nlmsg_end(msg, nlh);
 	ib_device_put(device);
-	return rdma_nl_unicast(msg, NETLINK_CB(skb).portid);
+	return rdma_nl_unicast(sock_net(skb->sk), msg, NETLINK_CB(skb).portid);

 err_fill:
 	rdma_counter_bind_qpn(device, port, qpn, cntn);
@@ -1893,7 +1893,7 @@ static int stat_get_doit_default_counter(struct sk_buff *skb,
 	mutex_unlock(&stats->lock);
 	nlmsg_end(msg, nlh);
 	ib_device_put(device);
-	return rdma_nl_unicast(msg, NETLINK_CB(skb).portid);
+	return rdma_nl_unicast(sock_net(skb->sk), msg, NETLINK_CB(skb).portid);

 err_table:
 	nla_nest_cancel(msg, table_attr);
@@ -1961,7 +1961,7 @@ static int stat_get_doit_qp(struct sk_buff *skb, struct nlmsghdr *nlh,

 	nlmsg_end(msg, nlh);
 	ib_device_put(device);
-	return rdma_nl_unicast(msg, NETLINK_CB(skb).portid);
+	return rdma_nl_unicast(sock_net(skb->sk), msg, NETLINK_CB(skb).portid);

 err_msg:
 	nlmsg_free(msg);
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 7d8071c7e564..17fc2936c077 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -860,7 +860,7 @@ static int ib_nl_send_msg(struct ib_sa_query *query, gfp_t gfp_mask)
 	/* Repair the nlmsg header length */
 	nlmsg_end(skb, nlh);

-	return rdma_nl_multicast(skb, RDMA_NL_GROUP_LS, gfp_mask);
+	return rdma_nl_multicast(&init_net, skb, RDMA_NL_GROUP_LS, gfp_mask);
 }

 static int ib_nl_make_request(struct ib_sa_query *query, gfp_t gfp_mask)
diff --git a/include/rdma/rdma_netlink.h b/include/rdma/rdma_netlink.h
index 6631624e4d7c..ab22759de7ea 100644
--- a/include/rdma/rdma_netlink.h
+++ b/include/rdma/rdma_netlink.h
@@ -76,28 +76,32 @@ int ibnl_put_attr(struct sk_buff *skb, struct nlmsghdr *nlh,

 /**
  * Send the supplied skb to a specific userspace PID.
+ * @net: Net namespace in which to send the skb
  * @skb: The netlink skb
  * @pid: Userspace netlink process ID
  * Returns 0 on success or a negative error code.
  */
-int rdma_nl_unicast(struct sk_buff *skb, u32 pid);
+int rdma_nl_unicast(struct net *net, struct sk_buff *skb, u32 pid);

 /**
  * Send, with wait/1 retry, the supplied skb to a specific userspace PID.
+ * @net: Net namespace in which to send the skb
  * @skb: The netlink skb
  * @pid: Userspace netlink process ID
  * Returns 0 on success or a negative error code.
  */
-int rdma_nl_unicast_wait(struct sk_buff *skb, __u32 pid);
+int rdma_nl_unicast_wait(struct net *net, struct sk_buff *skb, __u32 pid);

 /**
  * Send the supplied skb to a netlink group.
+ * @net: Net namespace in which to send the skb
  * @skb: The netlink skb
  * @group: Netlink group ID
  * @flags: allocation flags
  * Returns 0 on success or a negative error code.
  */
-int rdma_nl_multicast(struct sk_buff *skb, unsigned int group, gfp_t flags);
+int rdma_nl_multicast(struct net *net, struct sk_buff *skb,
+		      unsigned int group, gfp_t flags);

 /**
  * Check if there are any listeners to the netlink group
--
2.20.1


^ permalink raw reply related

* [PATCH rdma-next 0/2] Allow netlink commands in non init_net net namespace
From: Leon Romanovsky @ 2019-07-04 13:04 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Parav Pandit, Steve Wise

From: Leon Romanovsky <leonro@mellanox.com>

Now that RDMA devices can be attached to specific net namespace,
allow netlink commands in non init_net namespace.

Parav Pandit (2):
  IB/core: Work on the caller socket net namespace in nldev_newlink()
  IB: Support netlink commands in non init_net net namespaces

 drivers/infiniband/core/addr.c      |  2 +-
 drivers/infiniband/core/core_priv.h | 19 +++++++++--
 drivers/infiniband/core/device.c    | 34 ++++++------------
 drivers/infiniband/core/iwpm_msg.c  |  8 ++---
 drivers/infiniband/core/iwpm_util.c |  6 ++--
 drivers/infiniband/core/netlink.c   | 53 ++++++++++++++++++-----------
 drivers/infiniband/core/nldev.c     | 22 ++++++------
 drivers/infiniband/core/sa_query.c  |  2 +-
 include/rdma/rdma_netlink.h         | 10 ++++--
 9 files changed, 87 insertions(+), 69 deletions(-)

--
2.20.1


^ permalink raw reply

* [PATCH rdma-next] RDMA/core: Annotate destroy of mutex to ensure that it is released as unlocked
From: Leon Romanovsky @ 2019-07-04 13:00 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Parav Pandit, RDMA mailing list, Leon Romanovsky

From: Parav Pandit <parav@mellanox.com>

While compiled with CONFIG_DEBUG_MUTEXES, the kernel ensures that mutex
is not held during destroy.
Hence add mutex_destroy() for mutexes used in RDMA modules.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/core/cache.c        | 1 +
 drivers/infiniband/core/cma_configfs.c | 1 +
 drivers/infiniband/core/device.c       | 3 +++
 drivers/infiniband/core/user_mad.c     | 2 +-
 drivers/infiniband/core/uverbs_main.c  | 2 ++
 drivers/infiniband/core/verbs.c        | 1 +
 6 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 18e476b3ced0..00fb3eacda19 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -810,6 +810,7 @@ static void release_gid_table(struct ib_device *device,
 	if (leak)
 		return;
 
+	mutex_destroy(&table->lock);
 	kfree(table->data_vec);
 	kfree(table);
 }
diff --git a/drivers/infiniband/core/cma_configfs.c b/drivers/infiniband/core/cma_configfs.c
index 3ec2c415bb70..0a7b5eba2fc0 100644
--- a/drivers/infiniband/core/cma_configfs.c
+++ b/drivers/infiniband/core/cma_configfs.c
@@ -350,4 +350,5 @@ int __init cma_configfs_init(void)
 void __exit cma_configfs_exit(void)
 {
 	configfs_unregister_subsystem(&cma_subsys);
+	mutex_destroy(&cma_subsys.su_mutex);
 }
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 7f4affe8a10d..adf8d93bb42d 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -508,6 +508,9 @@ static void ib_device_release(struct device *device)
 			  rcu_head);
 	}
 
+	mutex_destroy(&dev->unregistration_lock);
+	mutex_destroy(&dev->compat_devs_mutex);
+
 	xa_destroy(&dev->compat_devs);
 	xa_destroy(&dev->client_data);
 	kfree_rcu(dev, rcu_head);
diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c
index 9f8a48016b41..e0512aef033c 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -1038,7 +1038,7 @@ static int ib_umad_close(struct inode *inode, struct file *filp)
 				ib_unregister_mad_agent(file->agent[i]);
 
 	mutex_unlock(&file->port->file_mutex);
-
+	mutex_destroy(&file->mutex);
 	kfree(file);
 	return 0;
 }
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 11c13c1381cf..4827aa3415ff 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -120,6 +120,8 @@ static void ib_uverbs_release_dev(struct device *device)
 
 	uverbs_destroy_api(dev->uapi);
 	cleanup_srcu_struct(&dev->disassociate_srcu);
+	mutex_destroy(&dev->lists_mutex);
+	mutex_destroy(&dev->xrcd_tree_mutex);
 	kfree(dev);
 }
 
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 92349bf37589..f974b6854224 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -2259,6 +2259,7 @@ int ib_dealloc_xrcd(struct ib_xrcd *xrcd, struct ib_udata *udata)
 		if (ret)
 			return ret;
 	}
+	mutex_destroy(&xrcd->tgt_qp_mutex);
 
 	return xrcd->device->ops.dealloc_xrcd(xrcd, udata);
 }
-- 
2.20.1


^ permalink raw reply related

* [PATCH rdma-next v4 1/3] linux/dim: Implement RDMA adaptive moderation (DIM)
From: Leon Romanovsky @ 2019-07-04 12:57 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Max Gurtovoy, Saeed Mahameed,
	Sagi Grimberg, Yamin Friedman
In-Reply-To: <20190704125743.7814-1-leon@kernel.org>

From: Yamin Friedman <yaminf@mellanox.com>

RDMA DIM implements a different algorithm from net DIM and is based on
completions which is how we can implement interrupt moderation in RDMA.

The algorithm optimizes for number of completions and ratio between
completions and events. In order to avoid long latencies, the
implementation performs fast reduction of moderation level when the
traffic changes.

Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 include/linux/dim.h |  36 +++++++++++++++
 lib/dim/Makefile    |   6 +--
 lib/dim/rdma_dim.c  | 108 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 146 insertions(+), 4 deletions(-)
 create mode 100644 lib/dim/rdma_dim.c

diff --git a/include/linux/dim.h b/include/linux/dim.h
index aa9bdd47a648..aa69730c3b8d 100644
--- a/include/linux/dim.h
+++ b/include/linux/dim.h
@@ -82,6 +82,7 @@ struct dim_stats {
  * @prev_stats: Measured rates from previous iteration (for comparison)
  * @start_sample: Sampled data at start of current iteration
  * @work: Work to perform on action required
+ * @priv: A pointer to the struct that points to dim
  * @profile_ix: Current moderation profile
  * @mode: CQ period count mode
  * @tune_state: Algorithm tuning state (see below)
@@ -95,6 +96,7 @@ struct dim {
 	struct dim_sample start_sample;
 	struct dim_sample measuring_sample;
 	struct work_struct work;
+	void *priv;
 	u8 profile_ix;
 	u8 mode;
 	u8 tune_state;
@@ -363,4 +365,38 @@ struct dim_cq_moder net_dim_get_def_tx_moderation(u8 cq_period_mode);
  */
 void net_dim(struct dim *dim, struct dim_sample end_sample);
 
+/* RDMA DIM */
+
+/*
+ * RDMA DIM profile:
+ * profile size must be of RDMA_DIM_PARAMS_NUM_PROFILES.
+ */
+#define RDMA_DIM_PARAMS_NUM_PROFILES 9
+#define RDMA_DIM_START_PROFILE 0
+
+static const struct dim_cq_moder
+rdma_dim_prof[RDMA_DIM_PARAMS_NUM_PROFILES] = {
+	{1,   0, 1,  0},
+	{1,   0, 4,  0},
+	{2,   0, 4,  0},
+	{2,   0, 8,  0},
+	{4,   0, 8,  0},
+	{16,  0, 8,  0},
+	{16,  0, 16, 0},
+	{32,  0, 16, 0},
+	{32,  0, 32, 0},
+};
+
+/**
+ * rdma_dim - Runs the adaptive moderation.
+ * @dim: The moderation struct.
+ * @completions: The number of completions collected in this round.
+ *
+ * Each call to rdma_dim takes the latest amount of completions that
+ * have been collected and counts them as a new event.
+ * Once enough events have been collected the algorithm decides a new
+ * moderation level.
+ */
+void rdma_dim(struct dim *dim, u64 completions);
+
 #endif /* DIM_H */
diff --git a/lib/dim/Makefile b/lib/dim/Makefile
index 160afe288df0..1d6858a108cb 100644
--- a/lib/dim/Makefile
+++ b/lib/dim/Makefile
@@ -2,8 +2,6 @@
 # DIM Dynamic Interrupt Moderation library
 #
 
-obj-$(CONFIG_DIMLIB) = net_dim.o
+obj-$(CONFIG_DIMLIB) += dim.o
 
-net_dim-y = \
-	dim.o		\
-	net_dim.o
+dim-y := dim.o net_dim.o rdma_dim.o
diff --git a/lib/dim/rdma_dim.c b/lib/dim/rdma_dim.c
new file mode 100644
index 000000000000..f7e26c7b4749
--- /dev/null
+++ b/lib/dim/rdma_dim.c
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2019, Mellanox Technologies inc.  All rights reserved.
+ */
+
+#include <linux/dim.h>
+
+static int rdma_dim_step(struct dim *dim)
+{
+	if (dim->tune_state == DIM_GOING_RIGHT) {
+		if (dim->profile_ix == (RDMA_DIM_PARAMS_NUM_PROFILES - 1))
+			return DIM_ON_EDGE;
+		dim->profile_ix++;
+		dim->steps_right++;
+	}
+	if (dim->tune_state == DIM_GOING_LEFT) {
+		if (dim->profile_ix == 0)
+			return DIM_ON_EDGE;
+		dim->profile_ix--;
+		dim->steps_left++;
+	}
+
+	return DIM_STEPPED;
+}
+
+static int rdma_dim_stats_compare(struct dim_stats *curr,
+				  struct dim_stats *prev)
+{
+	/* first stat */
+	if (!prev->cpms)
+		return DIM_STATS_SAME;
+
+	if (IS_SIGNIFICANT_DIFF(curr->cpms, prev->cpms))
+		return (curr->cpms > prev->cpms) ? DIM_STATS_BETTER :
+						DIM_STATS_WORSE;
+
+	if (IS_SIGNIFICANT_DIFF(curr->cpe_ratio, prev->cpe_ratio))
+		return (curr->cpe_ratio > prev->cpe_ratio) ? DIM_STATS_BETTER :
+						DIM_STATS_WORSE;
+
+	return DIM_STATS_SAME;
+}
+
+static bool rdma_dim_decision(struct dim_stats *curr_stats, struct dim *dim)
+{
+	int prev_ix = dim->profile_ix;
+	u8 state = dim->tune_state;
+	int stats_res;
+	int step_res;
+
+	if (state != DIM_PARKING_ON_TOP && state != DIM_PARKING_TIRED) {
+		stats_res = rdma_dim_stats_compare(curr_stats,
+						   &dim->prev_stats);
+
+		switch (stats_res) {
+		case DIM_STATS_SAME:
+			if (curr_stats->cpe_ratio <= 50 * prev_ix)
+				dim->profile_ix = 0;
+			break;
+		case DIM_STATS_WORSE:
+			dim_turn(dim);
+			/* fall through */
+		case DIM_STATS_BETTER:
+			step_res = rdma_dim_step(dim);
+			if (step_res == DIM_ON_EDGE)
+				dim_turn(dim);
+			break;
+		}
+	}
+
+	dim->prev_stats = *curr_stats;
+
+	return dim->profile_ix != prev_ix;
+}
+
+void rdma_dim(struct dim *dim, u64 completions)
+{
+	struct dim_sample *curr_sample = &dim->measuring_sample;
+	struct dim_stats curr_stats;
+	u32 nevents;
+
+	dim_update_sample_with_comps(curr_sample->event_ctr + 1, 0, 0,
+				     curr_sample->comp_ctr + completions,
+				     &dim->measuring_sample);
+
+	switch (dim->state) {
+	case DIM_MEASURE_IN_PROGRESS:
+		nevents = curr_sample->event_ctr - dim->start_sample.event_ctr;
+		if (nevents < DIM_NEVENTS)
+			break;
+		dim_calc_stats(&dim->start_sample, curr_sample, &curr_stats);
+		if (rdma_dim_decision(&curr_stats, dim)) {
+			dim->state = DIM_APPLY_NEW_PROFILE;
+			schedule_work(&dim->work);
+			break;
+		}
+		/* fall through */
+	case DIM_START_MEASURE:
+		dim->state = DIM_MEASURE_IN_PROGRESS;
+		dim_update_sample_with_comps(curr_sample->event_ctr, 0, 0,
+					     curr_sample->comp_ctr,
+					     &dim->start_sample);
+		break;
+	case DIM_APPLY_NEW_PROFILE:
+		break;
+	}
+}
+EXPORT_SYMBOL(rdma_dim);
-- 
2.20.1


^ permalink raw reply related

* [PATCH rdma-next v4 3/3] RDMA/nldev: Added configuration of RDMA dynamic interrupt moderation to netlink
From: Leon Romanovsky @ 2019-07-04 12:57 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Max Gurtovoy, Saeed Mahameed,
	Sagi Grimberg, Yamin Friedman
In-Reply-To: <20190704125743.7814-1-leon@kernel.org>

From: Yamin Friedman <yaminf@mellanox.com>

Added parameter in ib_device for enabling dynamic interrupt moderation so
that it can be configured in userspace using rdma tool.

In order to set adaptive-moderation for an ib device the command is:
rdma dev set [DEV] adaptive-moderation [on|off]
Please set on/off.

rdma dev show
0: mlx5_0: node_type ca fw 16.26.0055 node_guid 248a:0703:00a5:29d0
sys_image_guid 248a:0703:00a5:29d0 adaptive-moderation on

rdma resource show cq
dev mlx5_0 cqn 0 cqe 1023 users 4 poll-ctx UNBOUND_WORKQUEUE
adaptive-moderation off comm [ib_core]

Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/Kconfig          |  1 +
 drivers/infiniband/core/core_priv.h |  1 +
 drivers/infiniband/core/device.c    |  9 +++++++++
 drivers/infiniband/core/nldev.c     | 14 ++++++++++++++
 include/uapi/rdma/rdma_netlink.h    |  5 +++++
 5 files changed, 30 insertions(+)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index f277cb7aea29..85e103b147cc 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -7,6 +7,7 @@ menuconfig INFINIBAND
 	depends on m || IPV6 != m
 	depends on !ALPHA
 	select IRQ_POLL
+	select DIMLIB
 	---help---
 	  Core support for InfiniBand (IB).  Make sure to also select
 	  any protocols you wish to use as well as drivers for your
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index a953c2fa2e78..888d89ce81df 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -60,6 +60,7 @@ extern bool ib_devices_shared_netns;
 int ib_device_register_sysfs(struct ib_device *device);
 void ib_device_unregister_sysfs(struct ib_device *device);
 int ib_device_rename(struct ib_device *ibdev, const char *name);
+int ib_device_set_dim(struct ib_device *ibdev, u8 use_dim);
 
 typedef void (*roce_netdev_callback)(struct ib_device *device, u8 port,
 	      struct net_device *idev, void *cookie);
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index bdf61499e6d5..7f4affe8a10d 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -448,6 +448,15 @@ int ib_device_rename(struct ib_device *ibdev, const char *name)
 	return 0;
 }
 
+int ib_device_set_dim(struct ib_device *ibdev, u8 use_dim)
+{
+	if (use_dim > 1)
+		return -EINVAL;
+	ibdev->use_cq_dim = use_dim;
+
+	return 0;
+}
+
 static int alloc_name(struct ib_device *ibdev, const char *name)
 {
 	struct ib_device *device;
diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index a4431ed566b6..d9f2a30e6467 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -52,6 +52,7 @@ static const struct nla_policy nldev_policy[RDMA_NLDEV_ATTR_MAX] = {
 					.len = RDMA_NLDEV_ATTR_EMPTY_STRING },
 	[RDMA_NLDEV_ATTR_CHARDEV_TYPE]		= { .type = NLA_NUL_STRING,
 					.len = RDMA_NLDEV_ATTR_CHARDEV_TYPE_SIZE },
+	[RDMA_NLDEV_ATTR_DEV_DIM]               = { .type = NLA_U8 },
 	[RDMA_NLDEV_ATTR_DEV_INDEX]		= { .type = NLA_U32 },
 	[RDMA_NLDEV_ATTR_DEV_NAME]		= { .type = NLA_NUL_STRING,
 					.len = IB_DEVICE_NAME_MAX },
@@ -252,6 +253,8 @@ static int fill_dev_info(struct sk_buff *msg, struct ib_device *device)
 		return -EMSGSIZE;
 	if (nla_put_u8(msg, RDMA_NLDEV_ATTR_DEV_NODE_TYPE, device->node_type))
 		return -EMSGSIZE;
+	if (nla_put_u8(msg, RDMA_NLDEV_ATTR_DEV_DIM, device->use_cq_dim))
+		return -EMSGSIZE;
 
 	/*
 	 * Link type is determined on first port and mlx4 device
@@ -552,6 +555,9 @@ static int fill_res_cq_entry(struct sk_buff *msg, bool has_cap_net_admin,
 	    nla_put_u8(msg, RDMA_NLDEV_ATTR_RES_POLL_CTX, cq->poll_ctx))
 		goto err;
 
+	if (nla_put_u8(msg, RDMA_NLDEV_ATTR_DEV_DIM, (cq->dim != NULL)))
+		goto err;
+
 	if (nla_put_u32(msg, RDMA_NLDEV_ATTR_RES_CQN, res->id))
 		goto err;
 	if (!rdma_is_kernel_res(res) &&
@@ -870,6 +876,14 @@ static int nldev_set_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 		goto put_done;
 	}
 
+	if (tb[RDMA_NLDEV_ATTR_DEV_DIM]) {
+		u8 use_dim;
+
+		use_dim = nla_get_u8(tb[RDMA_NLDEV_ATTR_DEV_DIM]);
+		err = ib_device_set_dim(device,  use_dim);
+		goto done;
+	}
+
 done:
 	ib_device_put(device);
 put_done:
diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
index 8c5383e28438..f6d54c0a0ae8 100644
--- a/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -522,6 +522,11 @@ enum rdma_nldev_attr {
 	RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY_NAME,	/* string */
 	RDMA_NLDEV_ATTR_STAT_HWCOUNTER_ENTRY_VALUE,	/* u64 */
 
+	/*
+	 * CQ adaptive moderatio (DIM)
+	 */
+	RDMA_NLDEV_ATTR_DEV_DIM,                /* u8 */
+
 	/*
 	 * Always the end
 	 */
-- 
2.20.1


^ permalink raw reply related

* [PATCH rdma-next v4 2/3] RDMA/core: Provide RDMA DIM support for ULPs
From: Leon Romanovsky @ 2019-07-04 12:57 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Max Gurtovoy, Saeed Mahameed,
	Sagi Grimberg, Yamin Friedman
In-Reply-To: <20190704125743.7814-1-leon@kernel.org>

From: Yamin Friedman <yaminf@mellanox.com>

Added the interface in the infiniband driver that applies the rdma_dim
adaptive moderation. There is now a special function for allocating an
ib_cq that uses rdma_dim.

Performance improvement (ConnectX-5 100GbE, x86) running FIO benchmark over
NVMf between two equal end-hosts with 56 cores across a Mellanox switch
using null_blk device:

READS without DIM:
blk size | BW       | IOPS | 99th percentile latency  | 99.99th latency
512B     | 3.8GiB/s | 7.7M | 1401  usec               | 2442  usec
4k       | 7.0GiB/s | 1.8M | 4817  usec               | 6587  usec
64k      | 10.7GiB/s| 175k | 9896  usec               | 10028 usec

IO WRITES without DIM:
blk size | BW       | IOPS | 99th percentile latency  | 99.99th latency
512B     | 3.6GiB/s | 7.5M | 1434  usec               | 2474  usec
4k       | 6.3GiB/s | 1.6M | 938   usec               | 1221  usec
64k      | 10.7GiB/s| 175k | 8979  usec               | 12780 usec

IO READS with DIM:
blk size | BW       | IOPS | 99th percentile latency  | 99.99th latency
512B     | 4GiB/s   | 8.2M | 816    usec              | 889   usec
4k       | 10.1GiB/s| 2.65M| 3359   usec              | 5080  usec
64k      | 10.7GiB/s| 175k | 9896   usec              | 10028 usec

IO WRITES with DIM:
blk size | BW       | IOPS  | 99th percentile latency | 99.99th latency
512B     | 3.9GiB/s | 8.1M  | 799   usec              | 922   usec
4k       | 9.6GiB/s | 2.5M  | 717   usec              | 1004  usec
64k      | 10.7GiB/s| 176k  | 8586  usec              | 12256 usec

The rdma_dim algorithm was designed to measure the effectiveness of
moderation on the flow in a general way and thus should be appropriate
for all RDMA storage protocols.

rdma_dim is configured to be the default option based on performance
improvement seen after extensive tests.

Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/core/cq.c      | 45 +++++++++++++++++++++++++++++++
 drivers/infiniband/hw/mlx5/main.c |  2 ++
 include/rdma/ib_verbs.h           |  4 +++
 3 files changed, 51 insertions(+)

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index 00d70f166209..ffd6e24109d5 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -18,6 +18,40 @@
 #define IB_POLL_FLAGS \
 	(IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)
 
+static void ib_cq_rdma_dim_work(struct work_struct *w)
+{
+	struct dim *dim = container_of(w, struct dim, work);
+	struct ib_cq *cq = dim->priv;
+
+	u16 usec = rdma_dim_prof[dim->profile_ix].usec;
+	u16 comps = rdma_dim_prof[dim->profile_ix].comps;
+
+	dim->state = DIM_START_MEASURE;
+
+	cq->device->ops.modify_cq(cq, comps, usec);
+}
+
+static void rdma_dim_init(struct ib_cq *cq)
+{
+	struct dim *dim;
+
+	if (!cq->device->ops.modify_cq || !cq->device->use_cq_dim ||
+	    cq->poll_ctx == IB_POLL_DIRECT)
+		return;
+
+	dim = kzalloc(sizeof(struct dim), GFP_KERNEL);
+	if (!dim)
+		return;
+
+	dim->state = DIM_START_MEASURE;
+	dim->tune_state = DIM_GOING_RIGHT;
+	dim->profile_ix = RDMA_DIM_START_PROFILE;
+	dim->priv = cq;
+	cq->dim = dim;
+
+	INIT_WORK(&dim->work, ib_cq_rdma_dim_work);
+}
+
 static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *wcs,
 			   int batch)
 {
@@ -78,6 +112,7 @@ static void ib_cq_completion_direct(struct ib_cq *cq, void *private)
 static int ib_poll_handler(struct irq_poll *iop, int budget)
 {
 	struct ib_cq *cq = container_of(iop, struct ib_cq, iop);
+	struct dim *dim = cq->dim;
 	int completed;
 
 	completed = __ib_process_cq(cq, budget, cq->wc, IB_POLL_BATCH);
@@ -87,6 +122,9 @@ static int ib_poll_handler(struct irq_poll *iop, int budget)
 			irq_poll_sched(&cq->iop);
 	}
 
+	if (dim)
+		rdma_dim(dim, completed);
+
 	return completed;
 }
 
@@ -105,6 +143,8 @@ static void ib_cq_poll_work(struct work_struct *work)
 	if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
 	    ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
 		queue_work(cq->comp_wq, &cq->work);
+	else if (cq->dim)
+		rdma_dim(cq->dim, completed);
 }
 
 static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
@@ -161,6 +201,8 @@ struct ib_cq *__ib_alloc_cq_user(struct ib_device *dev, void *private,
 
 	rdma_restrack_kadd(&cq->res);
 
+	rdma_dim_init(cq);
+
 	switch (cq->poll_ctx) {
 	case IB_POLL_DIRECT:
 		cq->comp_handler = ib_cq_completion_direct;
@@ -223,6 +265,9 @@ void ib_free_cq_user(struct ib_cq *cq, struct ib_udata *udata)
 
 	rdma_restrack_del(&cq->res);
 	cq->device->ops.destroy_cq(cq, udata);
+	if (cq->dim)
+		cancel_work_sync(&cq->dim->work);
+	kfree(cq->dim);
 	kfree(cq->wc);
 	kfree(cq);
 }
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 7581571bd9cd..07a05b0b9e42 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -6424,6 +6424,8 @@ static int mlx5_ib_stage_caps_init(struct mlx5_ib_dev *dev)
 	     MLX5_CAP_GEN(dev->mdev, disable_local_lb_mc)))
 		mutex_init(&dev->lb.mutex);
 
+	dev->ib_dev.use_cq_dim = true;
+
 	return 0;
 }
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 50806bef9f20..30eb68f36109 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -61,6 +61,7 @@
 #include <linux/cgroup_rdma.h>
 #include <linux/irqflags.h>
 #include <linux/preempt.h>
+#include <linux/dim.h>
 #include <uapi/rdma/ib_user_verbs.h>
 #include <rdma/rdma_counter.h>
 #include <rdma/restrack.h>
@@ -1509,6 +1510,7 @@ struct ib_cq {
 		struct work_struct	work;
 	};
 	struct workqueue_struct *comp_wq;
+	struct dim *dim;
 	/*
 	 * Implementation details of the RDMA core, don't use in drivers:
 	 */
@@ -2576,6 +2578,8 @@ struct ib_device {
 	u16                          is_switch:1;
 	/* Indicates kernel verbs support, should not be used in drivers */
 	u16                          kverbs_provider:1;
+	/* CQ adaptive moderation (RDMA DIM) */
+	u16                          use_cq_dim:1;
 	u8                           node_type;
 	u8                           phys_port_cnt;
 	struct ib_device_attr        attrs;
-- 
2.20.1


^ permalink raw reply related

* [PATCH rdma-next v4 0/3] Use RDMA adaptive moderation library
From: Leon Romanovsky @ 2019-07-04 12:57 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Max Gurtovoy, Saeed Mahameed,
	Sagi Grimberg, Yamin Friedman

From: Leon Romanovsky <leonro@mellanox.com>

Hi,

This is RDMA part of previously sent DIM library improvements series
[1], which was pulled by Dave. It needs to be pulled to RDMA too as
a pre-requirements.

Changes since v3:
 * Renamed dim_owner to be priv
 * Added Sagi's ROBs
 * Removed casting of void pointer.

Changes since v2:
- renamed user-space knob from dim to adaptive-moderation (Sagi)
- some minor code clean ups (Sagi)
- Reordered patches to ensure that netlink expose is last in the series.
- Slightly cleaned commit messages
- Changed "bool use_cq_dim" flag to be bitwise to save space.

Changes since v1:
- added per ib device configuration knob for rdma-dim (Sagi)
- add NL directives for user-space / rdma tool to configure rdma dim
  (Sagi/Leon)
- use one header file for DIM implementations (Leon)
- various point changes in the rdma dim related code in the IB core
  (Leon)
- removed the RDMA specific patches form this pull request\

Thanks

[1] https://www.spinics.net/lists/netdev/msg581046.html

Yamin Friedman (3):
  linux/dim: Implement RDMA adaptive moderation (DIM)
  RDMA/core: Provide RDMA DIM support for ULPs
  RDMA/nldev: Added configuration of RDMA dynamic interrupt moderation
    to netlink

 drivers/infiniband/Kconfig          |   1 +
 drivers/infiniband/core/core_priv.h |   1 +
 drivers/infiniband/core/cq.c        |  45 ++++++++++++
 drivers/infiniband/core/device.c    |   9 +++
 drivers/infiniband/core/nldev.c     |  14 ++++
 drivers/infiniband/hw/mlx5/main.c   |   2 +
 include/linux/dim.h                 |  36 ++++++++++
 include/rdma/ib_verbs.h             |   4 ++
 include/uapi/rdma/rdma_netlink.h    |   5 ++
 lib/dim/Makefile                    |   6 +-
 lib/dim/rdma_dim.c                  | 108 ++++++++++++++++++++++++++++
 11 files changed, 227 insertions(+), 4 deletions(-)
 create mode 100644 lib/dim/rdma_dim.c

--
2.20.1


^ permalink raw reply

* [PATCH] RDMA/uverbs: remove redundant assignment to variable ret
From: Colin King @ 2019-07-04 12:50 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe, linux-rdma; +Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

The variable ret is being initialized with a value that is never
read and it is being updated later with a new value. The
initialization is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/infiniband/core/uverbs_cmd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 750c4d484329..7ddd0e5bc6b3 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -2548,7 +2548,7 @@ static int ib_uverbs_detach_mcast(struct uverbs_attr_bundle *attrs)
 	struct ib_uqp_object         *obj;
 	struct ib_qp                 *qp;
 	struct ib_uverbs_mcast_entry *mcast;
-	int                           ret = -EINVAL;
+	int                           ret;
 	bool                          found = false;
 
 	ret = uverbs_request(attrs, &cmd, sizeof(cmd));
-- 
2.20.1


^ permalink raw reply related

* Re: [net-next 1/3] ice: Initialize and register platform device to provide RDMA
From: Jason Gunthorpe @ 2019-07-04 12:48 UTC (permalink / raw)
  To: Greg KH
  Cc: Jeff Kirsher, davem@davemloft.net, dledford@redhat.com,
	Tony Nguyen, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	nhorman@redhat.com, sassmann@redhat.com, poswald@suse.com,
	mustafa.ismail@intel.com, shiraz.saleem@intel.com, Dave Ertman,
	Andrew Bowers
In-Reply-To: <20190704124247.GA6807@kroah.com>

On Thu, Jul 04, 2019 at 02:42:47PM +0200, Greg KH wrote:
> On Thu, Jul 04, 2019 at 12:37:33PM +0000, Jason Gunthorpe wrote:
> > On Thu, Jul 04, 2019 at 02:29:50PM +0200, Greg KH wrote:
> > > On Thu, Jul 04, 2019 at 12:16:41PM +0000, Jason Gunthorpe wrote:
> > > > On Wed, Jul 03, 2019 at 07:12:50PM -0700, Jeff Kirsher wrote:
> > > > > From: Tony Nguyen <anthony.l.nguyen@intel.com>
> > > > > 
> > > > > The RDMA block does not advertise on the PCI bus or any other bus.
> > > > > Thus the ice driver needs to provide access to the RDMA hardware block
> > > > > via a virtual bus; utilize the platform bus to provide this access.
> > > > > 
> > > > > This patch initializes the driver to support RDMA as well as creates
> > > > > and registers a platform device for the RDMA driver to register to. At
> > > > > this point the driver is fully initialized to register a platform
> > > > > driver, however, can not yet register as the ops have not been
> > > > > implemented.
> > > > 
> > > > I think you need Greg's ack on all this driver stuff - particularly
> > > > that a platform_device is OK.
> > > 
> > > A platform_device is almost NEVER ok.
> > > 
> > > Don't abuse it, make a real device on a real bus.  If you don't have a
> > > real bus and just need to create a device to hang other things off of,
> > > then use the virtual one, that's what it is there for.
> > 
> > Ideally I'd like to see all the RDMA drivers that connect to ethernet
> > drivers use some similar scheme.
> 
> Why?  They should be attached to a "real" device, why make any up?

? A "real" device, like struct pci_device, can only bind to one
driver. How can we bind it concurrently to net, rdma, scsi, etc?

> > This is for a PCI device that plugs into multiple subsystems in the
> > kernel, ie it has net driver functionality, rdma functionality, some
> > even have SCSI functionality
> 
> Sounds like a MFD device, why aren't you using that functionality
> instead?

This was also my advice, but in another email Jeff says:

  MFD architecture was also considered, and we selected the simpler
  platform model. Supporting a MFD architecture would require an
  additional MFD core driver, individual platform netdev, RDMA function
  drivers, and stripping a large portion of the netdev drivers into
  MFD core. The sub-devices registered by MFD core for function
  drivers are indeed platform devices.  

Thanks,
Jason

^ permalink raw reply

* Re: [net-next 1/3] ice: Initialize and register platform device to provide RDMA
From: Greg KH @ 2019-07-04 12:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jeff Kirsher, davem@davemloft.net, dledford@redhat.com,
	Tony Nguyen, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	nhorman@redhat.com, sassmann@redhat.com, poswald@suse.com,
	mustafa.ismail@intel.com, shiraz.saleem@intel.com, Dave Ertman,
	Andrew Bowers
In-Reply-To: <20190704123729.GF3401@mellanox.com>

On Thu, Jul 04, 2019 at 12:37:33PM +0000, Jason Gunthorpe wrote:
> On Thu, Jul 04, 2019 at 02:29:50PM +0200, Greg KH wrote:
> > On Thu, Jul 04, 2019 at 12:16:41PM +0000, Jason Gunthorpe wrote:
> > > On Wed, Jul 03, 2019 at 07:12:50PM -0700, Jeff Kirsher wrote:
> > > > From: Tony Nguyen <anthony.l.nguyen@intel.com>
> > > > 
> > > > The RDMA block does not advertise on the PCI bus or any other bus.
> > > > Thus the ice driver needs to provide access to the RDMA hardware block
> > > > via a virtual bus; utilize the platform bus to provide this access.
> > > > 
> > > > This patch initializes the driver to support RDMA as well as creates
> > > > and registers a platform device for the RDMA driver to register to. At
> > > > this point the driver is fully initialized to register a platform
> > > > driver, however, can not yet register as the ops have not been
> > > > implemented.
> > > 
> > > I think you need Greg's ack on all this driver stuff - particularly
> > > that a platform_device is OK.
> > 
> > A platform_device is almost NEVER ok.
> > 
> > Don't abuse it, make a real device on a real bus.  If you don't have a
> > real bus and just need to create a device to hang other things off of,
> > then use the virtual one, that's what it is there for.
> 
> Ideally I'd like to see all the RDMA drivers that connect to ethernet
> drivers use some similar scheme.

Why?  They should be attached to a "real" device, why make any up?

> Should it be some generic virtual bus?

There is a generic virtual bus today.

> This is for a PCI device that plugs into multiple subsystems in the
> kernel, ie it has net driver functionality, rdma functionality, some
> even have SCSI functionality

Sounds like a MFD device, why aren't you using that functionality
instead?

thanks,

greg k-h

^ permalink raw reply

* Re: [net-next 1/3] ice: Initialize and register platform device to provide RDMA
From: Jason Gunthorpe @ 2019-07-04 12:37 UTC (permalink / raw)
  To: Greg KH
  Cc: Jeff Kirsher, davem@davemloft.net, dledford@redhat.com,
	Tony Nguyen, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	nhorman@redhat.com, sassmann@redhat.com, poswald@suse.com,
	mustafa.ismail@intel.com, shiraz.saleem@intel.com, Dave Ertman,
	Andrew Bowers
In-Reply-To: <20190704122950.GA6007@kroah.com>

On Thu, Jul 04, 2019 at 02:29:50PM +0200, Greg KH wrote:
> On Thu, Jul 04, 2019 at 12:16:41PM +0000, Jason Gunthorpe wrote:
> > On Wed, Jul 03, 2019 at 07:12:50PM -0700, Jeff Kirsher wrote:
> > > From: Tony Nguyen <anthony.l.nguyen@intel.com>
> > > 
> > > The RDMA block does not advertise on the PCI bus or any other bus.
> > > Thus the ice driver needs to provide access to the RDMA hardware block
> > > via a virtual bus; utilize the platform bus to provide this access.
> > > 
> > > This patch initializes the driver to support RDMA as well as creates
> > > and registers a platform device for the RDMA driver to register to. At
> > > this point the driver is fully initialized to register a platform
> > > driver, however, can not yet register as the ops have not been
> > > implemented.
> > 
> > I think you need Greg's ack on all this driver stuff - particularly
> > that a platform_device is OK.
> 
> A platform_device is almost NEVER ok.
> 
> Don't abuse it, make a real device on a real bus.  If you don't have a
> real bus and just need to create a device to hang other things off of,
> then use the virtual one, that's what it is there for.

Ideally I'd like to see all the RDMA drivers that connect to ethernet
drivers use some similar scheme.

Should it be some generic virtual bus?

This is for a PCI device that plugs into multiple subsystems in the
kernel, ie it has net driver functionality, rdma functionality, some
even have SCSI functionality

Jason

^ permalink raw reply

* Re: [RFC rdma 1/3] RDMA/core: Create a common mmap function
From: Jason Gunthorpe @ 2019-07-04 12:35 UTC (permalink / raw)
  To: Gal Pressman
  Cc: Michal Kalderon, dledford@redhat.com, leon@kernel.org,
	sleybo@amazon.com, Ariel Elior, linux-rdma@vger.kernel.org
In-Reply-To: <85247f12-1d78-0e66-fadc-d04862511ca7@amazon.com>

On Wed, Jul 03, 2019 at 11:19:34AM +0300, Gal Pressman wrote:
> On 03/07/2019 1:31, Jason Gunthorpe wrote:
> >> Seems except Mellanox + hns the mmap flags aren't ABI. 
> >> Also, current Mellanox code seems like it won't benefit from 
> >> mmap cookie helper functions in any case as the mmap function is very specific and the flags used indicate 
> >> the address and not just how to map it.
> > 
> > IMHO, mlx5 has a goofy implementaiton here as it codes all of the object
> > type, handle and cachability flags in one thing.
> 
> Do we need object type flags as well in the generic mmap code?

At the end of the day the driver needs to know what page to map during
the mmap syscall.

mlx5 does this by encoding the page type in the address, and then many
types have seperate lookups based onthe offset for the actual page.

IMHO the single lookup and opaque offset is generally better..

Since the mlx5 scheme is ABI it can't be changed unfortunately.

If you want to do user controlled cachability flags, or not, is a fair
question, but they still become ABI..

I'm wondering if it really makes sense to do that during the mmap, or
if the cachability should be set as part of creating the cookie?

> Another issue is that these flags aren't exposed in an ABI file, so
> a userspace library can't really make use of it in current state.

Woops.

Ah, this is all ABI so you need to dig out of this hole ASAP :)

Jason

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox