Linux RDMA and InfiniBand development

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Leon Romanovsky @ 2019-07-09 13:32 UTC (permalink / raw)
  To: Greg KH
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, axboe,
	Christoph Hellwig, Sagi Grimberg, bvanassche, jgg, dledford,
	Roman Pen
In-Reply-To: <20190709111737.GB6719@kroah.com>

On Tue, Jul 09, 2019 at 01:17:37PM +0200, Greg KH wrote:
> On Tue, Jul 09, 2019 at 02:00:36PM +0300, Leon Romanovsky wrote:
> > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > >
> > > Could you please provide some feedback to the IBNBD driver and the
> > > IBTRS library?
> > > So far we addressed all the requests provided by the community and
> > > continue to maintain our code up-to-date with the upstream kernel
> > > while having an extra compatibility layer for older kernels in our
> > > out-of-tree repository.
> > > I understand that SRP and NVMEoF which are in the kernel already do
> > > provide equivalent functionality for the majority of the use cases.
> > > IBNBD on the other hand is showing higher performance and more
> > > importantly includes the IBTRS - a general purpose library to
> > > establish connections and transport BIO-like read/write sg-lists over
> > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > it would make sense for us to rework our patchset and try pushing it
> > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > beneficial for the eco-system, find a proper location for it within
> > > block/rdma subsystems? This would make it easier for people to try it
> > > out and would also be a huge step for us in terms of maintenance
> > > effort.
> > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > near future). Do you think it would make sense to rename the driver to
> > > RNBD/RTRS?
> >
> > It is better to avoid "staging" tree, because it will lack attention of
> > relevant people and your efforts will be lost once you will try to move
> > out of staging. We are all remembering Lustre and don't want to see it
> > again.
>
> That's up to the developers, that had nothing to do with the fact that
> the code was in the staging tree.  If the Lustre developers had actually
> done the requested work, it would have moved out of the staging tree.
>
> So if these developers are willing to do the work to get something out
> of staging, and into the "real" part of the kernel, I will gladly take
> it.

Greg,

It is not matter of how much *real* work developers will do, but
it is a matter of guidance to do *right* thing, which is hard to achieve
if people mentioned in the beginning of this thread wouldn't look on
staging code.

>
> But I will note that it is almost always easier to just do the work
> ahead of time, and merge it in "correctly" than to go from staging into
> the real part of the kernel.  But it's up to the developers what they
> want to do.
>
> thanks,
>
> greg k-h

^ permalink raw reply

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Jason Gunthorpe @ 2019-07-09 13:19 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Jinpu Wang, Leon Romanovsky, Danil Kipnis,
	linux-block@vger.kernel.org, linux-rdma@vger.kernel.org,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, bvanassche@acm.org,
	dledford@redhat.com, Roman Pen, Greg Kroah-Hartman
In-Reply-To: <CAMGffE=T+FVfVzV5cCtVrm_6ikdJ9pjpFsPgx+t0EUpegoZELQ@mail.gmail.com>

On Tue, Jul 09, 2019 at 03:15:46PM +0200, Jinpu Wang wrote:
> On Tue, Jul 9, 2019 at 2:06 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
> >
> > On Tue, Jul 09, 2019 at 01:37:39PM +0200, Jinpu Wang wrote:
> > > Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道：
> > > >
> > > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > > > >
> > > > > Could you please provide some feedback to the IBNBD driver and the
> > > > > IBTRS library?
> > > > > So far we addressed all the requests provided by the community and
> > > > > continue to maintain our code up-to-date with the upstream kernel
> > > > > while having an extra compatibility layer for older kernels in our
> > > > > out-of-tree repository.
> > > > > I understand that SRP and NVMEoF which are in the kernel already do
> > > > > provide equivalent functionality for the majority of the use cases.
> > > > > IBNBD on the other hand is showing higher performance and more
> > > > > importantly includes the IBTRS - a general purpose library to
> > > > > establish connections and transport BIO-like read/write sg-lists over
> > > > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > > > it would make sense for us to rework our patchset and try pushing it
> > > > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > > > beneficial for the eco-system, find a proper location for it within
> > > > > block/rdma subsystems? This would make it easier for people to try it
> > > > > out and would also be a huge step for us in terms of maintenance
> > > > > effort.
> > > > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > > > near future). Do you think it would make sense to rename the driver to
> > > > > RNBD/RTRS?
> > > >
> > > > It is better to avoid "staging" tree, because it will lack attention of
> > > > relevant people and your efforts will be lost once you will try to move
> > > > out of staging. We are all remembering Lustre and don't want to see it
> > > > again.
> > > >
> > > > Back then, you was asked to provide support for performance superiority.
> > > > Can you please share any numbers with us?
> > > Hi Leon,
> > >
> > > Thanks for you feedback.
> > >
> > > For performance numbers,  Danil did intensive benchmark, and create
> > > some PDF with graphes here:
> > > https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3
> > >
> > > It includes both single path results also different multipath policy results.
> > >
> > > If you have any question regarding the results, please let us know.
> >
> > I kind of recall that last time the perf numbers were skewed toward
> > IBNBD because the invalidation model for MR was wrong - did this get
> > fixed?
> >
> > Jason
> 
> Thanks Jason for feedback.
> Can you be  more specific about  "the invalidation model for MR was wrong"

MR's must be invalidated before data is handed over to the block
layer. It can't leave MRs open for access and then touch the memory
the MR covers.

IMHO this is the most likely explanation for any performance difference
from nvme..

> I checked in the history of the email thread, only found
> "I think from the RDMA side, before we accept something like this, I'd
> like to hear from Christoph, Chuck or Sagi that the dataplane
> implementation of this is correct, eg it uses the MRs properly and
> invalidates at the right time, sequences with dma_ops as required,
> etc.
> "
> And no reply from any of you since then.

This task still needs to happen..

Jason

^ permalink raw reply

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Jinpu Wang @ 2019-07-09 13:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jinpu Wang, Leon Romanovsky, Danil Kipnis,
	linux-block@vger.kernel.org, linux-rdma@vger.kernel.org,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, bvanassche@acm.org,
	dledford@redhat.com, Roman Pen, Greg Kroah-Hartman
In-Reply-To: <20190709120606.GB3436@mellanox.com>

On Tue, Jul 9, 2019 at 2:06 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
>
> On Tue, Jul 09, 2019 at 01:37:39PM +0200, Jinpu Wang wrote:
> > Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道：
> > >
> > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > > >
> > > > Could you please provide some feedback to the IBNBD driver and the
> > > > IBTRS library?
> > > > So far we addressed all the requests provided by the community and
> > > > continue to maintain our code up-to-date with the upstream kernel
> > > > while having an extra compatibility layer for older kernels in our
> > > > out-of-tree repository.
> > > > I understand that SRP and NVMEoF which are in the kernel already do
> > > > provide equivalent functionality for the majority of the use cases.
> > > > IBNBD on the other hand is showing higher performance and more
> > > > importantly includes the IBTRS - a general purpose library to
> > > > establish connections and transport BIO-like read/write sg-lists over
> > > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > > it would make sense for us to rework our patchset and try pushing it
> > > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > > beneficial for the eco-system, find a proper location for it within
> > > > block/rdma subsystems? This would make it easier for people to try it
> > > > out and would also be a huge step for us in terms of maintenance
> > > > effort.
> > > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > > near future). Do you think it would make sense to rename the driver to
> > > > RNBD/RTRS?
> > >
> > > It is better to avoid "staging" tree, because it will lack attention of
> > > relevant people and your efforts will be lost once you will try to move
> > > out of staging. We are all remembering Lustre and don't want to see it
> > > again.
> > >
> > > Back then, you was asked to provide support for performance superiority.
> > > Can you please share any numbers with us?
> > Hi Leon,
> >
> > Thanks for you feedback.
> >
> > For performance numbers,  Danil did intensive benchmark, and create
> > some PDF with graphes here:
> > https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3
> >
> > It includes both single path results also different multipath policy results.
> >
> > If you have any question regarding the results, please let us know.
>
> I kind of recall that last time the perf numbers were skewed toward
> IBNBD because the invalidation model for MR was wrong - did this get
> fixed?
>
> Jason

Thanks Jason for feedback.
Can you be  more specific about  "the invalidation model for MR was wrong"

I checked in the history of the email thread, only found
"I think from the RDMA side, before we accept something like this, I'd
like to hear from Christoph, Chuck or Sagi that the dataplane
implementation of this is correct, eg it uses the MRs properly and
invalidates at the right time, sequences with dma_ops as required,
etc.
"
And no reply from any of you since then.

Thanks,
Jack

^ permalink raw reply

* Re: linux-next: build failure after merge of the rdma tree
From: Jason Gunthorpe @ 2019-07-09 12:46 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Stephen Rothwell, Doug Ledford, Mark Zhang,
	Linux Next Mailing List, Linux Kernel Mailing List, Majd Dibbiny,
	asahiro Yamada, linux-rdma@vger.kernel.org
In-Reply-To: <20190709071758.GI7034@mtr-leonro.mtl.com>

On Tue, Jul 09, 2019 at 04:18:00AM -0300, Leon Romanovsky wrote:
> On Tue, Jul 09, 2019 at 10:04:16AM +0300, Mark Zhang wrote:
> > Hi Stephen,
> 
> Stephen,
> 
> For some reason, I wasn't in initial email report, can you please check why?
> 
> I need to be aware of any issues related to patches with my name on it
> for tracking and improving internal submission flows/checks.
> 
> >
> > Can you please try the patch below, thank you.
> 
> Jason, Doug,
> 
> Can you please take this patch?

It isn't quite enough to make the header compile stand alone, I'm
adding this instead.

From 37c1e072276b03b080eb24ff24c39080aeaf49ef Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe <jgg@mellanox.com>
Date: Tue, 9 Jul 2019 09:44:47 -0300
Subject: [PATCH] RDMA/counters: Make rdma_counter.h compile stand alone

5.4-rc1 will have new compile time debugging to test that headers can be
compiled stand alone. Many rdma headers are already broken and excluded
from the mechanism, however to avoid compile failures during the merge
window fix enough so that the newly added header compiles clean.

Fixes: 413d3347503b ("RDMA/counter: Add set/clear per-port auto mode support")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Mark Zhang <markz@mellanox.com>
---
 include/rdma/rdma_counter.h | 2 +-
 include/rdma/restrack.h     | 5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/rdma/rdma_counter.h b/include/rdma/rdma_counter.h
index 68827700ba957e..eb99856e8b3078 100644
--- a/include/rdma/rdma_counter.h
+++ b/include/rdma/rdma_counter.h
@@ -9,10 +9,10 @@
 #include <linux/mutex.h>
 #include <linux/pid_namespace.h>
 
-#include <rdma/ib_verbs.h>
 #include <rdma/restrack.h>
 #include <rdma/rdma_netlink.h>
 
+struct ib_device;
 struct ib_qp;
 
 struct auto_mode_param {
diff --git a/include/rdma/restrack.h b/include/rdma/restrack.h
index 4041a4d96524b4..b0fc6b26bdf531 100644
--- a/include/rdma/restrack.h
+++ b/include/rdma/restrack.h
@@ -14,6 +14,9 @@
 #include <uapi/rdma/rdma_netlink.h>
 #include <linux/xarray.h>
 
+struct ib_device;
+struct sk_buff;
+
 /**
  * enum rdma_restrack_type - HW objects to track
  */
@@ -52,8 +55,6 @@ enum rdma_restrack_type {
 	RDMA_RESTRACK_MAX
 };
 
-struct ib_device;
-
 /**
  * struct rdma_restrack_entry - metadata per-entry
  */
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH v2] RDMA/core: Fix race when resolving IP address
From: Jason Gunthorpe @ 2019-07-09 12:26 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Parav Pandit, Dag Moxnes, Doug Ledford, Parav Pandit, linux-rdma,
	Linux Kernel Mailing List
In-Reply-To: <20190705040950.GO7212@mtr-leonro.mtl.com>

On Fri, Jul 05, 2019 at 07:09:50AM +0300, Leon Romanovsky wrote:
> On Fri, Jul 05, 2019 at 07:49:06AM +0530, Parav Pandit wrote:
> > On Fri, Jun 28, 2019 at 2:20 PM Dag Moxnes <dag.moxnes@oracle.com> wrote:
> > >
> > > Use neighbour lock when copying MAC address from neighbour data struct
> > > in dst_fetch_ha.
> > >
> > > When not using the lock, it is possible for the function to race with
> > > neigh_update, causing it to copy an invalid MAC address.
> > >
> > > It is possible to provoke this error by calling rdma_resolve_addr in a
> > > tight loop, while deleting the corresponding ARP entry in another tight
> > > loop.
> > >
> > > Signed-off-by: Dag Moxnes <dag.moxnes@oracle.com>
> > > Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
> > >
> > > v1 -> v2:
> > >    * Modified implementation to improve readability
> > >  drivers/infiniband/core/addr.c | 9 ++++++---
> > >  1 file changed, 6 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
> > > index 2f7d141598..51323ffbc5 100644
> > > +++ b/drivers/infiniband/core/addr.c
> > > @@ -333,11 +333,14 @@ static int dst_fetch_ha(const struct dst_entry *dst,
> > >         if (!n)
> > >                 return -ENODATA;
> > >
> > > -       if (!(n->nud_state & NUD_VALID)) {
> > > +       read_lock_bh(&n->lock);
> > > +       if (n->nud_state & NUD_VALID) {
> > > +               memcpy(dev_addr->dst_dev_addr, n->ha, MAX_ADDR_LEN);
> > > +               read_unlock_bh(&n->lock);
> > > +       } else {
> > > +               read_unlock_bh(&n->lock);
> > >                 neigh_event_send(n, NULL);
> > >                 ret = -ENODATA;
> > > -       } else {
> > > -               memcpy(dev_addr->dst_dev_addr, n->ha, MAX_ADDR_LEN);
> > >         }
> > >
> > >         neigh_release(n);
> > >
> > Reviewed-by: Parav Pandit <parav@mellanox.com>
> >
> > A sample trace such as below in commit message would be good to have.
> > Or the similar one that you noticed with ARP delete sequence.
> >
> > neigh_changeaddr()
> >   neigh_flush_dev()
> >    n->nud_state = NUD_NOARP;
> >
> > Having some issues with office outlook, so replying via gmail.
> 
> Your replies from gmail looks much better when you used Outlook - proper
> spacing between quoted text and your reply.

Why not use thunderbird or something?

Jason

^ permalink raw reply

* Re: [PATCH rdma-next 2/2] IB: Support netlink commands in non init_net net namespaces
From: Jason Gunthorpe @ 2019-07-09 12:25 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Doug Ledford, RDMA mailing list, Parav Pandit, Steve Wise
In-Reply-To: <20190709063842.GE7034@mtr-leonro.mtl.com>

On Tue, Jul 09, 2019 at 09:38:42AM +0300, Leon Romanovsky wrote:
> On Mon, Jul 08, 2019 at 05:20:23PM -0300, Jason Gunthorpe wrote:
> > On Thu, Jul 04, 2019 at 04:04:02PM +0300, Leon Romanovsky wrote:
> > > -int rdma_nl_unicast(struct sk_buff *skb, u32 pid)
> > > +int rdma_nl_unicast(struct net *net, struct sk_buff *skb, u32 pid)
> > >  {
> > > +	struct rdma_dev_net *rnet = net_generic(net, rdma_dev_net_id);
> >
> > This should be a proper type safe accessor in all places
> 
> "const struct net *net" and not "struct net *net"?

No, 

  static inline struct rdma_dev_net *rdma_get_net(struct net *net);

> >
> > > -void rdma_nl_exit(void)
> > > +void rdma_nl_net_exit(struct rdma_dev_net *rnet)
> > >  {
> > > -	int idx;
> > > -
> > > -	for (idx = 0; idx < RDMA_NL_NUM_CLIENTS; idx++)
> > > -		rdma_nl_unregister(idx);
> >
> > There should be a WARN_ON during the module unload that no NL clients
> > are still registered
> 
> IMHO, the usage of WARN_ON is overrated.

If there is to be any code at all I want to see WARN_ON's for things
that can't happen. Not pr_debug, not nonsense loops like the above.

Jason

^ permalink raw reply

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Jason Gunthorpe @ 2019-07-09 12:06 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Leon Romanovsky, Danil Kipnis, linux-block@vger.kernel.org,
	linux-rdma@vger.kernel.org, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, bvanassche@acm.org, dledford@redhat.com, Roman Pen,
	Greg Kroah-Hartman, Jinpu Wang
In-Reply-To: <CAD9gYJL=fo4Oa2hmU4WZgQrzypRbzoPrrFjNQKP2EZFXYxYNCA@mail.gmail.com>

On Tue, Jul 09, 2019 at 01:37:39PM +0200, Jinpu Wang wrote:
> Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道：
> >
> > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > >
> > > Could you please provide some feedback to the IBNBD driver and the
> > > IBTRS library?
> > > So far we addressed all the requests provided by the community and
> > > continue to maintain our code up-to-date with the upstream kernel
> > > while having an extra compatibility layer for older kernels in our
> > > out-of-tree repository.
> > > I understand that SRP and NVMEoF which are in the kernel already do
> > > provide equivalent functionality for the majority of the use cases.
> > > IBNBD on the other hand is showing higher performance and more
> > > importantly includes the IBTRS - a general purpose library to
> > > establish connections and transport BIO-like read/write sg-lists over
> > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > it would make sense for us to rework our patchset and try pushing it
> > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > beneficial for the eco-system, find a proper location for it within
> > > block/rdma subsystems? This would make it easier for people to try it
> > > out and would also be a huge step for us in terms of maintenance
> > > effort.
> > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > near future). Do you think it would make sense to rename the driver to
> > > RNBD/RTRS?
> >
> > It is better to avoid "staging" tree, because it will lack attention of
> > relevant people and your efforts will be lost once you will try to move
> > out of staging. We are all remembering Lustre and don't want to see it
> > again.
> >
> > Back then, you was asked to provide support for performance superiority.
> > Can you please share any numbers with us?
> Hi Leon,
> 
> Thanks for you feedback.
> 
> For performance numbers,  Danil did intensive benchmark, and create
> some PDF with graphes here:
> https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3
> 
> It includes both single path results also different multipath policy results.
> 
> If you have any question regarding the results, please let us know.

I kind of recall that last time the perf numbers were skewed toward
IBNBD because the invalidation model for MR was wrong - did this get
fixed?

Jason

^ permalink raw reply

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Jason Gunthorpe @ 2019-07-09 12:04 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jack Wang, linux-block@vger.kernel.org,
	linux-rdma@vger.kernel.org, axboe@kernel.dk, Christoph Hellwig,
	Sagi Grimberg, bvanassche@acm.org, dledford@redhat.com, Roman Pen,
	gregkh@linuxfoundation.org
In-Reply-To: <CAHg0HuzUaKs-ACHah-VdNHbot0_usx4ErMesVAw8+DFR63FFqw@mail.gmail.com>

On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> 
> Could you please provide some feedback to the IBNBD driver and the
> IBTRS library?

From my perspective you need to get people from the block community to
go over this.

It is the merge window right now so nobody is really looking at
patches, you may need to resend it after rc1 to get attention.

Jason

^ permalink raw reply

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Jinpu Wang @ 2019-07-09 11:57 UTC (permalink / raw)
  To: Greg KH
  Cc: Leon Romanovsky, Danil Kipnis, linux-block, linux-rdma,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, bvanassche, jgg,
	dledford, Roman Pen, Jinpu Wang
In-Reply-To: <20190709111737.GB6719@kroah.com>

Greg KH <gregkh@linuxfoundation.org> 于2019年7月9日周二 下午1:17写道：
>
> On Tue, Jul 09, 2019 at 02:00:36PM +0300, Leon Romanovsky wrote:
> > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > >
> > > Could you please provide some feedback to the IBNBD driver and the
> > > IBTRS library?
> > > So far we addressed all the requests provided by the community and
> > > continue to maintain our code up-to-date with the upstream kernel
> > > while having an extra compatibility layer for older kernels in our
> > > out-of-tree repository.
> > > I understand that SRP and NVMEoF which are in the kernel already do
> > > provide equivalent functionality for the majority of the use cases.
> > > IBNBD on the other hand is showing higher performance and more
> > > importantly includes the IBTRS - a general purpose library to
> > > establish connections and transport BIO-like read/write sg-lists over
> > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > it would make sense for us to rework our patchset and try pushing it
> > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > beneficial for the eco-system, find a proper location for it within
> > > block/rdma subsystems? This would make it easier for people to try it
> > > out and would also be a huge step for us in terms of maintenance
> > > effort.
> > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > near future). Do you think it would make sense to rename the driver to
> > > RNBD/RTRS?
> >
> > It is better to avoid "staging" tree, because it will lack attention of
> > relevant people and your efforts will be lost once you will try to move
> > out of staging. We are all remembering Lustre and don't want to see it
> > again.
>
> That's up to the developers, that had nothing to do with the fact that
> the code was in the staging tree.  If the Lustre developers had actually
> done the requested work, it would have moved out of the staging tree.
>
> So if these developers are willing to do the work to get something out
> of staging, and into the "real" part of the kernel, I will gladly take
> it.
Thanks Greg,

This is encouraging, we ARE willing to do the work to get IBNBD/IBTRS merged to
upstream kernel. We regularly contribute to stable kernel also
upsteam, backport patches, testing
stable rc release etc. We believe in opensource and the power of community.

Sure, we will try to go with so called real kernel, this is also what
we are doing
and did in the past, but since v3, we did not receive any real feedback.

We will see how thing will go.

Thanks again!
Jack Wang @ 1 & 1 IONOS Cloud GmbH


>
> But I will note that it is almost always easier to just do the work
> ahead of time, and merge it in "correctly" than to go from staging into
> the real part of the kernel.  But it's up to the developers what they
> want to do.
>
> thanks,
>
> greg k-h

^ permalink raw reply

* [PATCH v4] RDMA/core: Fix race when resolving IP address
From: Dag Moxnes @ 2019-07-09 11:50 UTC (permalink / raw)
  To: dledford, jgg, leon, parav; +Cc: linux-rdma, linux-kernel

Use the neighbour lock when copying the MAC address from the neighbour
data struct in dst_fetch_ha.

When not using the lock, it is possible for the function to race with
neigh_update(), causing it to copy an torn MAC address:

rdma_resolve_addr()
  rdma_resolve_ip()
    addr_resolve()
      addr_resolve_neigh()
        fetch_ha()
          dst_fetch_ha()
	     memcpy(dev_addr->dst_dev_addr, n->ha, MAX_ADDR_LEN)

and

net_ioctl()
  arp_ioctl()
    arp_rec_delete()
      arp_invalidate()
        neigh_update()
          __neigh_update()
	    memcpy(&neigh->ha, lladdr, dev->addr_len)

It is possible to provoke this error by calling rdma_resolve_addr() in a
tight loop, while deleting the corresponding ARP entry in another tight
loop.

Fixes: 51d45974515c ("infiniband: addr: Consolidate code to fetch neighbour hardware address from dst.")
Signed-off-by: Dag Moxnes <dag.moxnes@oracle.com>
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/infiniband/core/addr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 2f7d141598..9b76a8fcdd 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -337,7 +337,7 @@ static int dst_fetch_ha(const struct dst_entry *dst,
 		neigh_event_send(n, NULL);
 		ret = -ENODATA;
 	} else {
-		memcpy(dev_addr->dst_dev_addr, n->ha, MAX_ADDR_LEN);
+		neigh_ha_snapshot(dev_addr->dst_dev_addr, n, dst->dev);
 	}
 
 	neigh_release(n);
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Jinpu Wang @ 2019-07-09 11:37 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Danil Kipnis, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, bvanassche, jgg, dledford,
	Roman Pen, Greg Kroah-Hartman, Jinpu Wang
In-Reply-To: <20190709110036.GQ7034@mtr-leonro.mtl.com>

Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道：
>
> On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> >
> > Could you please provide some feedback to the IBNBD driver and the
> > IBTRS library?
> > So far we addressed all the requests provided by the community and
> > continue to maintain our code up-to-date with the upstream kernel
> > while having an extra compatibility layer for older kernels in our
> > out-of-tree repository.
> > I understand that SRP and NVMEoF which are in the kernel already do
> > provide equivalent functionality for the majority of the use cases.
> > IBNBD on the other hand is showing higher performance and more
> > importantly includes the IBTRS - a general purpose library to
> > establish connections and transport BIO-like read/write sg-lists over
> > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > it would make sense for us to rework our patchset and try pushing it
> > for staging tree first, so that we can proof IBNBD is well maintained,
> > beneficial for the eco-system, find a proper location for it within
> > block/rdma subsystems? This would make it easier for people to try it
> > out and would also be a huge step for us in terms of maintenance
> > effort.
> > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > near future). Do you think it would make sense to rename the driver to
> > RNBD/RTRS?
>
> It is better to avoid "staging" tree, because it will lack attention of
> relevant people and your efforts will be lost once you will try to move
> out of staging. We are all remembering Lustre and don't want to see it
> again.
>
> Back then, you was asked to provide support for performance superiority.
> Can you please share any numbers with us?
Hi Leon,

Thanks for you feedback.

For performance numbers,  Danil did intensive benchmark, and create
some PDF with graphes here:
https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3

It includes both single path results also different multipath policy results.

If you have any question regarding the results, please let us know.

>
> Thanks

Thanks
Jack Wang

^ permalink raw reply

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Greg KH @ 2019-07-09 11:17 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, axboe,
	Christoph Hellwig, Sagi Grimberg, bvanassche, jgg, dledford,
	Roman Pen
In-Reply-To: <20190709110036.GQ7034@mtr-leonro.mtl.com>

On Tue, Jul 09, 2019 at 02:00:36PM +0300, Leon Romanovsky wrote:
> On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> >
> > Could you please provide some feedback to the IBNBD driver and the
> > IBTRS library?
> > So far we addressed all the requests provided by the community and
> > continue to maintain our code up-to-date with the upstream kernel
> > while having an extra compatibility layer for older kernels in our
> > out-of-tree repository.
> > I understand that SRP and NVMEoF which are in the kernel already do
> > provide equivalent functionality for the majority of the use cases.
> > IBNBD on the other hand is showing higher performance and more
> > importantly includes the IBTRS - a general purpose library to
> > establish connections and transport BIO-like read/write sg-lists over
> > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > it would make sense for us to rework our patchset and try pushing it
> > for staging tree first, so that we can proof IBNBD is well maintained,
> > beneficial for the eco-system, find a proper location for it within
> > block/rdma subsystems? This would make it easier for people to try it
> > out and would also be a huge step for us in terms of maintenance
> > effort.
> > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > near future). Do you think it would make sense to rename the driver to
> > RNBD/RTRS?
> 
> It is better to avoid "staging" tree, because it will lack attention of
> relevant people and your efforts will be lost once you will try to move
> out of staging. We are all remembering Lustre and don't want to see it
> again.

That's up to the developers, that had nothing to do with the fact that
the code was in the staging tree.  If the Lustre developers had actually
done the requested work, it would have moved out of the staging tree.

So if these developers are willing to do the work to get something out
of staging, and into the "real" part of the kernel, I will gladly take
it.

But I will note that it is almost always easier to just do the work
ahead of time, and merge it in "correctly" than to go from staging into
the real part of the kernel.  But it's up to the developers what they
want to do.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Leon Romanovsky @ 2019-07-09 11:00 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jack Wang, linux-block, linux-rdma, axboe, Christoph Hellwig,
	Sagi Grimberg, bvanassche, jgg, dledford, Roman Pen, gregkh
In-Reply-To: <CAHg0HuzUaKs-ACHah-VdNHbot0_usx4ErMesVAw8+DFR63FFqw@mail.gmail.com>

On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
>
> Could you please provide some feedback to the IBNBD driver and the
> IBTRS library?
> So far we addressed all the requests provided by the community and
> continue to maintain our code up-to-date with the upstream kernel
> while having an extra compatibility layer for older kernels in our
> out-of-tree repository.
> I understand that SRP and NVMEoF which are in the kernel already do
> provide equivalent functionality for the majority of the use cases.
> IBNBD on the other hand is showing higher performance and more
> importantly includes the IBTRS - a general purpose library to
> establish connections and transport BIO-like read/write sg-lists over
> RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> I believe IBNBD does meet the kernel coding standards, it doesn't have
> a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> it would make sense for us to rework our patchset and try pushing it
> for staging tree first, so that we can proof IBNBD is well maintained,
> beneficial for the eco-system, find a proper location for it within
> block/rdma subsystems? This would make it easier for people to try it
> out and would also be a huge step for us in terms of maintenance
> effort.
> The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> near future). Do you think it would make sense to rename the driver to
> RNBD/RTRS?

It is better to avoid "staging" tree, because it will lack attention of
relevant people and your efforts will be lost once you will try to move
out of staging. We are all remembering Lustre and don't want to see it
again.

Back then, you was asked to provide support for performance superiority.
Can you please share any numbers with us?

Thanks

^ permalink raw reply

* Re: [PATCH 1/2][next] RDMA/hns: fix comparison of unsigned long variable 'end' with less than zero
From: oulijun @ 2019-07-09 10:58 UTC (permalink / raw)
  To: Colin King, Wei Hu, Doug Ledford, Jason Gunthorpe, linux-rdma
  Cc: kernel-janitors, linux-kernel
In-Reply-To: <20190531092101.28772-1-colin.king@canonical.com>

在 2019/5/31 17:21, Colin King 写道:
> From: Colin Ian King <colin.king@canonical.com>
>
> Currently the comparison of end with less than zero is always false
> because end is an unsigned long.  Also, replace checks of end with
> non-zero with end > 0 as it is possible that the #defined decrement
> may be changed in the future causing end to step over zero and go
> negative.
>
> The initialization of end with 0 is also redundant as this value is
> never read and is later set to HW_SYNC_TIMEOUT_MSECS, so fix this by
> initializing it with this value to begin with.
>
> Addresses-Coverity: ("Unsigned compared against 0")
> Fixes: 669cefb654cb ("RDMA/hns: Remove jiffies operation in disable interrupt context")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>  drivers/infiniband/hw/hns/hns_roce_hem.c   |  4 ++--
>  drivers/infiniband/hw/hns/hns_roce_hw_v1.c | 12 ++++++------
>  2 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c b/drivers/infiniband/hw/hns/hns_roce_hem.c
> index 157c84a1f55f..b3641aeff27a 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
> @@ -326,7 +326,7 @@ static int hns_roce_set_hem(struct hns_roce_dev *hr_dev,
>  {
>  	spinlock_t *lock = &hr_dev->bt_cmd_lock;
>  	struct device *dev = hr_dev->dev;
> -	unsigned long end = 0;
> +	long end;
>  	unsigned long flags;
>  	struct hns_roce_hem_iter iter;
>  	void __iomem *bt_cmd;
> @@ -377,7 +377,7 @@ static int hns_roce_set_hem(struct hns_roce_dev *hr_dev,
>  		bt_cmd = hr_dev->reg_base + ROCEE_BT_CMD_H_REG;
>  
>  		end = HW_SYNC_TIMEOUT_MSECS;
> -		while (end) {
> +		while (end > 0) {
>  			if (!readl(bt_cmd) >> BT_CMD_SYNC_SHIFT)
>  				break;
>  
> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
> index 1fc77b1f2c6d..e13fea71bcb8 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
> @@ -966,7 +966,7 @@ static int hns_roce_v1_recreate_lp_qp(struct hns_roce_dev *hr_dev)
>  	struct hns_roce_free_mr *free_mr;
>  	struct hns_roce_v1_priv *priv;
>  	struct completion comp;
> -	unsigned long end = HNS_ROCE_V1_RECREATE_LP_QP_TIMEOUT_MSECS;
> +	long end = HNS_ROCE_V1_RECREATE_LP_QP_TIMEOUT_MSECS;
>  
>  	priv = (struct hns_roce_v1_priv *)hr_dev->priv;
>  	free_mr = &priv->free_mr;
> @@ -986,7 +986,7 @@ static int hns_roce_v1_recreate_lp_qp(struct hns_roce_dev *hr_dev)
>  
>  	queue_work(free_mr->free_mr_wq, &(lp_qp_work->work));
>  
> -	while (end) {
> +	while (end > 0) {
>  		if (try_wait_for_completion(&comp))
>  			return 0;
>  		msleep(HNS_ROCE_V1_RECREATE_LP_QP_WAIT_VALUE);
> @@ -1104,7 +1104,7 @@ static int hns_roce_v1_dereg_mr(struct hns_roce_dev *hr_dev,
>  	struct hns_roce_free_mr *free_mr;
>  	struct hns_roce_v1_priv *priv;
>  	struct completion comp;
> -	unsigned long end = HNS_ROCE_V1_FREE_MR_TIMEOUT_MSECS;
> +	long end = HNS_ROCE_V1_FREE_MR_TIMEOUT_MSECS;
>  	unsigned long start = jiffies;
>  	int npages;
>  	int ret = 0;
> @@ -1134,7 +1134,7 @@ static int hns_roce_v1_dereg_mr(struct hns_roce_dev *hr_dev,
>  
>  	queue_work(free_mr->free_mr_wq, &(mr_work->work));
>  
> -	while (end) {
> +	while (end > 0) {
>  		if (try_wait_for_completion(&comp))
>  			goto free_mr;
>  		msleep(HNS_ROCE_V1_FREE_MR_WAIT_VALUE);
> @@ -2425,7 +2425,8 @@ static int hns_roce_v1_clear_hem(struct hns_roce_dev *hr_dev,
>  {
>  	struct device *dev = &hr_dev->pdev->dev;
>  	struct hns_roce_v1_priv *priv;
> -	unsigned long end = 0, flags = 0;
> +	unsigned long flags = 0;
> +	long end = HW_SYNC_TIMEOUT_MSECS;
>  	__le32 bt_cmd_val[2] = {0};
>  	void __iomem *bt_cmd;
>  	u64 bt_ba = 0;
> @@ -2463,7 +2464,6 @@ static int hns_roce_v1_clear_hem(struct hns_roce_dev *hr_dev,
>  
>  	bt_cmd = hr_dev->reg_base + ROCEE_BT_CMD_H_REG;
>  
> -	end = HW_SYNC_TIMEOUT_MSECS;
>  	while (1) {
>  		if (readl(bt_cmd) >> BT_CMD_SYNC_SHIFT) {
>  			if (end < 0) {

Good, thanks.



^ permalink raw reply

* Re: [PATCH v5 rdma-next 1/6] RDMA/core: Create mmap database and cookie helper functions
From: Leon Romanovsky @ 2019-07-09 10:52 UTC (permalink / raw)
  To: Michal Kalderon
  Cc: Ariel Elior, jgg@ziepe.ca, dledford@redhat.com,
	galpress@amazon.com, linux-rdma@vger.kernel.org,
	davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <MN2PR18MB31825FB2CCC56C47763C9D0DA1F10@MN2PR18MB3182.namprd18.prod.outlook.com>

On Tue, Jul 09, 2019 at 10:29:36AM +0000, Michal Kalderon wrote:
> > From: linux-rdma-owner@vger.kernel.org <linux-rdma-
> > owner@vger.kernel.org> On Behalf Of Leon Romanovsky
> >
> > On Mon, Jul 08, 2019 at 12:14:58PM +0300, Michal Kalderon wrote:
> > > Create some common API's for adding entries to a xa_mmap.
> > > Searching for an entry and freeing one.
> > >
> > > The code was copied from the efa driver almost as is, just renamed
> > > function to be generic and not efa specific.
> > >
> > > Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
> > > ---
> > >  drivers/infiniband/core/device.c      |   1 +
> > >  drivers/infiniband/core/rdma_core.c   |   1 +
> > >  drivers/infiniband/core/uverbs_cmd.c  |   1 +
> > >  drivers/infiniband/core/uverbs_main.c | 105
> > ++++++++++++++++++++++++++++++++++
> > >  include/rdma/ib_verbs.h               |  32 +++++++++++
> > >  5 files changed, 140 insertions(+)
> > >
> > > diff --git a/drivers/infiniband/core/device.c
> > > b/drivers/infiniband/core/device.c
> > > index 8a6ccb936dfe..a830c2c5d691 100644
> > > --- a/drivers/infiniband/core/device.c
> > > +++ b/drivers/infiniband/core/device.c
> > > @@ -2521,6 +2521,7 @@ void ib_set_device_ops(struct ib_device *dev,
> > const struct ib_device_ops *ops)
> > >  	SET_DEVICE_OP(dev_ops, map_mr_sg_pi);
> > >  	SET_DEVICE_OP(dev_ops, map_phys_fmr);
> > >  	SET_DEVICE_OP(dev_ops, mmap);
> > > +	SET_DEVICE_OP(dev_ops, mmap_free);
> > >  	SET_DEVICE_OP(dev_ops, modify_ah);
> > >  	SET_DEVICE_OP(dev_ops, modify_cq);
> > >  	SET_DEVICE_OP(dev_ops, modify_device); diff --git
> > > a/drivers/infiniband/core/rdma_core.c
> > > b/drivers/infiniband/core/rdma_core.c
> > > index ccf4d069c25c..7166741834c8 100644
> > > --- a/drivers/infiniband/core/rdma_core.c
> > > +++ b/drivers/infiniband/core/rdma_core.c
> > > @@ -817,6 +817,7 @@ static void ufile_destroy_ucontext(struct
> > ib_uverbs_file *ufile,
> > >  	rdma_restrack_del(&ucontext->res);
> > >
> > >  	ib_dev->ops.dealloc_ucontext(ucontext);
> > > +	rdma_user_mmap_entries_remove_free(ucontext);
> > >  	kfree(ucontext);
> > >
> > >  	ufile->ucontext = NULL;
> > > diff --git a/drivers/infiniband/core/uverbs_cmd.c
> > > b/drivers/infiniband/core/uverbs_cmd.c
> > > index 7ddd0e5bc6b3..44c0600245e4 100644
> > > --- a/drivers/infiniband/core/uverbs_cmd.c
> > > +++ b/drivers/infiniband/core/uverbs_cmd.c
> > > @@ -254,6 +254,7 @@ static int ib_uverbs_get_context(struct
> > > uverbs_attr_bundle *attrs)
> > >
> > >  	mutex_init(&ucontext->per_mm_list_lock);
> > >  	INIT_LIST_HEAD(&ucontext->per_mm_list);
> > > +	xa_init(&ucontext->mmap_xa);
> > >
> > >  	ret = get_unused_fd_flags(O_CLOEXEC);
> > >  	if (ret < 0)
> > > diff --git a/drivers/infiniband/core/uverbs_main.c
> > > b/drivers/infiniband/core/uverbs_main.c
> > > index 11c13c1381cf..37507cc27e8c 100644
> > > --- a/drivers/infiniband/core/uverbs_main.c
> > > +++ b/drivers/infiniband/core/uverbs_main.c
> > > @@ -965,6 +965,111 @@ int rdma_user_mmap_io(struct ib_ucontext
> > > *ucontext, struct vm_area_struct *vma,  }
> > > EXPORT_SYMBOL(rdma_user_mmap_io);
> > >
> > > +static inline u64
> > > +rdma_user_mmap_get_key(const struct rdma_user_mmap_entry
> > *entry) {
> > > +	return (u64)entry->mmap_page << PAGE_SHIFT; }
> > > +
> > > +struct rdma_user_mmap_entry *
> > > +rdma_user_mmap_entry_get(struct ib_ucontext *ucontext, u64 key, u64
> > > +len) {
> > > +	struct rdma_user_mmap_entry *entry;
> > > +	u64 mmap_page;
> > > +
> > > +	mmap_page = key >> PAGE_SHIFT;
> > > +	if (mmap_page > U32_MAX)
> > > +		return NULL;
> > > +
> > > +	entry = xa_load(&ucontext->mmap_xa, mmap_page);
> > > +	if (!entry || rdma_user_mmap_get_key(entry) != key ||
> > > +	    entry->length != len)
> > > +		return NULL;
> > > +
> > > +	ibdev_dbg(ucontext->device,
> > > +		  "mmap: obj[0x%p] key[%#llx] addr[%#llx] len[%#llx]
> > removed\n",
> > > +		  entry->obj, key, entry->address, entry->length);
> > > +
> > > +	return entry;
> > > +}
> > > +EXPORT_SYMBOL(rdma_user_mmap_entry_get);
> >
> > Please add function description in kernel doc format for all newly
> > EXPORT_SYMBOL() functions you introduced in RDMA/core.
> Ok. Could you give me a reference to an example? Where should the
> Documentation be added to?

Above function in *.c file.

For example, see function rdma_set_ack_timeout():
https://patchwork.kernel.org/patch/10778827/

Thanks

^ permalink raw reply

* RE: [PATCH v5 rdma-next 2/6] RDMA/efa: Use the common mmap_xa helpers
From: Michal Kalderon @ 2019-07-09 10:30 UTC (permalink / raw)
  To: Gal Pressman, Ariel Elior, jgg@ziepe.ca, dledford@redhat.com
  Cc: linux-rdma@vger.kernel.org, davem@davemloft.net,
	netdev@vger.kernel.org
In-Reply-To: <0a28f174-1875-452e-ea0a-c8db2d243ce5@amazon.com>

> From: Gal Pressman <galpress@amazon.com>
> Sent: Tuesday, July 9, 2019 12:03 PM
> 
> On 08/07/2019 12:14, Michal Kalderon wrote:
> 
> Hi, a few nits:
Thanks for the review, will fix them.

> 
> > Remove the functions related to managing the mmap_xa database.
> > This code was copied to the ib_core. Use the common API's instead.
> >
> > Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
> > ---
> >  drivers/infiniband/hw/efa/efa.h       |   3 +-
> >  drivers/infiniband/hw/efa/efa_main.c  |   1 +
> >  drivers/infiniband/hw/efa/efa_verbs.c | 183
> > ++++++++--------------------------
> >  3 files changed, 42 insertions(+), 145 deletions(-) diff --git
> > a/drivers/infiniband/hw/efa/efa_verbs.c
> > b/drivers/infiniband/hw/efa/efa_verbs.c
> > index df77bc312a25..5dff892da161 100644
> > --- a/drivers/infiniband/hw/efa/efa_verbs.c
> > +++ b/drivers/infiniband/hw/efa/efa_verbs.c
> > @@ -13,34 +13,15 @@
> >
> >  #include "efa.h"
> >
> > -#define EFA_MMAP_FLAG_SHIFT 56
> > -#define EFA_MMAP_PAGE_MASK GENMASK(EFA_MMAP_FLAG_SHIFT -
> 1, 0)
> > -#define EFA_MMAP_INVALID U64_MAX
> > -
> 
> Don't delete the blank line please.
> 
> >  enum {
> >  	EFA_MMAP_DMA_PAGE = 0,
> >  	EFA_MMAP_IO_WC,
> >  	EFA_MMAP_IO_NC,
> >  };
> > -
> >  #define EFA_AENQ_ENABLED_GROUPS \
> >  	(BIT(EFA_ADMIN_FATAL_ERROR) | BIT(EFA_ADMIN_WARNING) | \
> >  	 BIT(EFA_ADMIN_NOTIFICATION) | BIT(EFA_ADMIN_KEEP_ALIVE))
> >
> > -struct efa_mmap_entry {
> > -	void  *obj;
> > -	u64 address;
> > -	u64 length;
> > -	u32 mmap_page;
> > -	u8 mmap_flag;
> > -};
> > -
> > -static inline u64 get_mmap_key(const struct efa_mmap_entry *efa) -{
> > -	return ((u64)efa->mmap_flag << EFA_MMAP_FLAG_SHIFT) |
> > -	       ((u64)efa->mmap_page << PAGE_SHIFT);
> > -}
> > -
> >  #define EFA_CHUNK_PAYLOAD_SHIFT       12
> >  #define EFA_CHUNK_PAYLOAD_SIZE
> BIT(EFA_CHUNK_PAYLOAD_SHIFT)
> >  #define EFA_CHUNK_PAYLOAD_PTR_SIZE    8
> > @@ -145,105 +126,7 @@ static void *efa_zalloc_mapped(struct efa_dev
> *dev, dma_addr_t *dma_addr,
> >  	return addr;
> >  }
> >
> > -/*
> > - * This is only called when the ucontext is destroyed and there can
> > be no
> > - * concurrent query via mmap or allocate on the xarray, thus we can
> > be sure no
> > - * other thread is using the entry pointer. We also know that all the
> > BAR
> > - * pages have either been zap'd or munmaped at this point.  Normal
> > pages are
> > - * refcounted and will be freed at the proper time.
> > - */
> > -static void mmap_entries_remove_free(struct efa_dev *dev,
> > -				     struct efa_ucontext *ucontext)
> > -{
> > -	struct efa_mmap_entry *entry;
> > -	unsigned long mmap_page;
> >
> > -	xa_for_each(&ucontext->mmap_xa, mmap_page, entry) {
> > -		xa_erase(&ucontext->mmap_xa, mmap_page);
> > -
> > -		ibdev_dbg(
> > -			&dev->ibdev,
> > -			"mmap: obj[0x%p] key[%#llx] addr[%#llx] len[%#llx]
> removed\n",
> > -			entry->obj, get_mmap_key(entry), entry->address,
> > -			entry->length);
> > -		if (entry->mmap_flag == EFA_MMAP_DMA_PAGE)
> > -			/* DMA mapping is already gone, now free the pages
> */
> > -			free_pages_exact(phys_to_virt(entry->address),
> > -					 entry->length);
> > -		kfree(entry);
> > -	}
> > -}
> > -
> > -static struct efa_mmap_entry *mmap_entry_get(struct efa_dev *dev,
> > -					     struct efa_ucontext *ucontext,
> > -					     u64 key, u64 len)
> > -{
> > -	struct efa_mmap_entry *entry;
> > -	u64 mmap_page;
> > -
> > -	mmap_page = (key & EFA_MMAP_PAGE_MASK) >> PAGE_SHIFT;
> > -	if (mmap_page > U32_MAX)
> > -		return NULL;
> > -
> > -	entry = xa_load(&ucontext->mmap_xa, mmap_page);
> > -	if (!entry || get_mmap_key(entry) != key || entry->length != len)
> > -		return NULL;
> > -
> > -	ibdev_dbg(&dev->ibdev,
> > -		  "mmap: obj[0x%p] key[%#llx] addr[%#llx] len[%#llx]
> removed\n",
> > -		  entry->obj, key, entry->address, entry->length);
> > -
> > -	return entry;
> > -}
> > -
> > -/*
> > - * Note this locking scheme cannot support removal of entries, except
> > during
> > - * ucontext destruction when the core code guarentees no concurrency.
> > - */
> > -static u64 mmap_entry_insert(struct efa_dev *dev, struct efa_ucontext
> *ucontext,
> > -			     void *obj, u64 address, u64 length, u8 mmap_flag)
> > -{
> > -	struct efa_mmap_entry *entry;
> > -	u32 next_mmap_page;
> > -	int err;
> > -
> > -	entry = kmalloc(sizeof(*entry), GFP_KERNEL);
> > -	if (!entry)
> > -		return EFA_MMAP_INVALID;
> > -
> > -	entry->obj = obj;
> > -	entry->address = address;
> > -	entry->length = length;
> > -	entry->mmap_flag = mmap_flag;
> > -
> > -	xa_lock(&ucontext->mmap_xa);
> > -	if (check_add_overflow(ucontext->mmap_xa_page,
> > -			       (u32)(length >> PAGE_SHIFT),
> > -			       &next_mmap_page))
> > -		goto err_unlock;
> > -
> > -	entry->mmap_page = ucontext->mmap_xa_page;
> > -	ucontext->mmap_xa_page = next_mmap_page;
> > -	err = __xa_insert(&ucontext->mmap_xa, entry->mmap_page,
> entry,
> > -			  GFP_KERNEL);
> > -	if (err)
> > -		goto err_unlock;
> > -
> > -	xa_unlock(&ucontext->mmap_xa);
> > -
> > -	ibdev_dbg(
> > -		&dev->ibdev,
> > -		"mmap: obj[0x%p] addr[%#llx], len[%#llx], key[%#llx]
> inserted\n",
> > -		entry->obj, entry->address, entry->length,
> get_mmap_key(entry));
> > -
> > -	return get_mmap_key(entry);
> > -
> > -err_unlock:
> > -	xa_unlock(&ucontext->mmap_xa);
> > -	kfree(entry);
> > -	return EFA_MMAP_INVALID;
> > -
> > -}
> >
> 
> You left two extra blank lines between efa_zalloc_mapped and
> efa_query_device.
> 
> >  int efa_query_device(struct ib_device *ibdev,
> >  		     struct ib_device_attr *props,
> > @@ -488,45 +371,52 @@ static int qp_mmap_entries_setup(struct efa_qp
> *qp,
> >  				 struct efa_com_create_qp_params
> *params,
> >  				 struct efa_ibv_create_qp_resp *resp)  {
> > +	u64 address;
> > +	u64 length;
> 
> Line break.
> 
> >  	/*
> >  	 * Once an entry is inserted it might be mmapped, hence cannot be
> >  	 * cleaned up until dealloc_ucontext.
> >  	 */
> >  	resp->sq_db_mmap_key =
> > -		mmap_entry_insert(dev, ucontext, qp,
> > -				  dev->db_bar_addr + resp->sq_db_offset,
> > -				  PAGE_SIZE, EFA_MMAP_IO_NC);
> > -	if (resp->sq_db_mmap_key == EFA_MMAP_INVALID)
> > +		rdma_user_mmap_entry_insert(&ucontext->ibucontext, qp,
> > +					    dev->db_bar_addr +
> > +					    resp->sq_db_offset,
> > +					    PAGE_SIZE, EFA_MMAP_IO_NC);
> > +	if (resp->sq_db_mmap_key == RDMA_USER_MMAP_INVALID)
> >  		return -ENOMEM;
> >
> >  	resp->sq_db_offset &= ~PAGE_MASK;
> >
> > +	address = dev->mem_bar_addr + resp->llq_desc_offset;
> > +	length = PAGE_ALIGN(params->sq_ring_size_in_bytes +
> > +			    (resp->llq_desc_offset & ~PAGE_MASK));
> >  	resp->llq_desc_mmap_key =
> > -		mmap_entry_insert(dev, ucontext, qp,
> > -				  dev->mem_bar_addr + resp-
> >llq_desc_offset,
> > -				  PAGE_ALIGN(params-
> >sq_ring_size_in_bytes +
> > -					     (resp->llq_desc_offset &
> ~PAGE_MASK)),
> > -				  EFA_MMAP_IO_WC);
> > -	if (resp->llq_desc_mmap_key == EFA_MMAP_INVALID)
> > +		rdma_user_mmap_entry_insert(&ucontext->ibucontext, qp,
> > +					    address,
> > +					    length,
> > +					    EFA_MMAP_IO_WC);
> > +	if (resp->llq_desc_mmap_key == RDMA_USER_MMAP_INVALID)
> >  		return -ENOMEM;
> >
> >  	resp->llq_desc_offset &= ~PAGE_MASK;
> >
> >  	if (qp->rq_size) {
> > +		address = dev->db_bar_addr + resp->rq_db_offset;
> >  		resp->rq_db_mmap_key =
> > -			mmap_entry_insert(dev, ucontext, qp,
> > -					  dev->db_bar_addr + resp-
> >rq_db_offset,
> > -					  PAGE_SIZE, EFA_MMAP_IO_NC);
> > -		if (resp->rq_db_mmap_key == EFA_MMAP_INVALID)
> > +			rdma_user_mmap_entry_insert(&ucontext-
> >ibucontext, qp,
> > +						    address, PAGE_SIZE,
> > +						    EFA_MMAP_IO_NC);
> > +		if (resp->rq_db_mmap_key ==
> RDMA_USER_MMAP_INVALID)
> >  			return -ENOMEM;
> >
> >  		resp->rq_db_offset &= ~PAGE_MASK;
> >
> > +		address = virt_to_phys(qp->rq_cpu_addr);
> >  		resp->rq_mmap_key =
> > -			mmap_entry_insert(dev, ucontext, qp,
> > -					  virt_to_phys(qp->rq_cpu_addr),
> > -					  qp->rq_size,
> EFA_MMAP_DMA_PAGE);
> > -		if (resp->rq_mmap_key == EFA_MMAP_INVALID)
> > +			rdma_user_mmap_entry_insert(&ucontext-
> >ibucontext, qp,
> > +						    address, qp->rq_size,
> > +						    EFA_MMAP_DMA_PAGE);
> > +		if (resp->rq_mmap_key == RDMA_USER_MMAP_INVALID)
> >  			return -ENOMEM;
> >
> >  		resp->rq_mmap_size = qp->rq_size;
> > @@ -875,11 +765,13 @@ void efa_destroy_cq(struct ib_cq *ibcq, struct
> > ib_udata *udata)  static int cq_mmap_entries_setup(struct efa_dev *dev,
> struct efa_cq *cq,
> >  				 struct efa_ibv_create_cq_resp *resp)  {
> > +	struct efa_ucontext *ucontext = cq->ucontext;
> 
> Line break.
> 
> >  	resp->q_mmap_size = cq->size;
> > -	resp->q_mmap_key = mmap_entry_insert(dev, cq->ucontext, cq,
> > -					     virt_to_phys(cq->cpu_addr),
> > -					     cq->size,
> EFA_MMAP_DMA_PAGE);
> > -	if (resp->q_mmap_key == EFA_MMAP_INVALID)
> > +	resp->q_mmap_key =
> > +		rdma_user_mmap_entry_insert(&ucontext->ibucontext, cq,
> > +					    virt_to_phys(cq->cpu_addr),
> > +					    cq->size,
> EFA_MMAP_DMA_PAGE);
> > +	if (resp->q_mmap_key == RDMA_USER_MMAP_INVALID)
> >  		return -ENOMEM;
> >
> >  	return 0;
> > @@ -1531,7 +1423,6 @@ int efa_alloc_ucontext(struct ib_ucontext
> *ibucontext, struct ib_udata *udata)
> >  		goto err_out;
> >
> >  	ucontext->uarn = result.uarn;
> > -	xa_init(&ucontext->mmap_xa);
> >
> >  	resp.cmds_supp_udata_mask |=
> EFA_USER_CMDS_SUPP_UDATA_QUERY_DEVICE;
> >  	resp.cmds_supp_udata_mask |=
> EFA_USER_CMDS_SUPP_UDATA_CREATE_AH;
> > @@ -1560,19 +1451,25 @@ void efa_dealloc_ucontext(struct ib_ucontext
> *ibucontext)
> >  	struct efa_ucontext *ucontext = to_eucontext(ibucontext);
> >  	struct efa_dev *dev = to_edev(ibucontext->device);
> >
> > -	mmap_entries_remove_free(dev, ucontext);
> >  	efa_dealloc_uar(dev, ucontext->uarn);
> >  }
> >
> > +void efa_mmap_free(u64 address, u64 length, u8 mmap_flag)
> > +{
> > +	/* DMA mapping is already gone, now free the pages */
> > +	if (mmap_flag == EFA_MMAP_DMA_PAGE)
> > +		free_pages_exact(phys_to_virt(address), length);
> > +}
> > +
> >  static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext
> *ucontext,
> >  		      struct vm_area_struct *vma, u64 key, u64 length)
> >  {
> > -	struct efa_mmap_entry *entry;
> > +	struct rdma_user_mmap_entry *entry;
> >  	unsigned long va;
> >  	u64 pfn;
> >  	int err;
> >
> > -	entry = mmap_entry_get(dev, ucontext, key, length);
> > +	entry = rdma_user_mmap_entry_get(&ucontext->ibucontext, key,
> length);
> >  	if (!entry) {
> >  		ibdev_dbg(&dev->ibdev, "key[%#llx] does not have valid
> entry\n",
> >  			  key);
> >

^ permalink raw reply

* RE: [PATCH v5 rdma-next 1/6] RDMA/core: Create mmap database and cookie helper functions
From: Michal Kalderon @ 2019-07-09 10:29 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Ariel Elior, jgg@ziepe.ca, dledford@redhat.com,
	galpress@amazon.com, linux-rdma@vger.kernel.org,
	davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <20190709070244.GH7034@mtr-leonro.mtl.com>

> From: linux-rdma-owner@vger.kernel.org <linux-rdma-
> owner@vger.kernel.org> On Behalf Of Leon Romanovsky
> 
> On Mon, Jul 08, 2019 at 12:14:58PM +0300, Michal Kalderon wrote:
> > Create some common API's for adding entries to a xa_mmap.
> > Searching for an entry and freeing one.
> >
> > The code was copied from the efa driver almost as is, just renamed
> > function to be generic and not efa specific.
> >
> > Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
> > ---
> >  drivers/infiniband/core/device.c      |   1 +
> >  drivers/infiniband/core/rdma_core.c   |   1 +
> >  drivers/infiniband/core/uverbs_cmd.c  |   1 +
> >  drivers/infiniband/core/uverbs_main.c | 105
> ++++++++++++++++++++++++++++++++++
> >  include/rdma/ib_verbs.h               |  32 +++++++++++
> >  5 files changed, 140 insertions(+)
> >
> > diff --git a/drivers/infiniband/core/device.c
> > b/drivers/infiniband/core/device.c
> > index 8a6ccb936dfe..a830c2c5d691 100644
> > --- a/drivers/infiniband/core/device.c
> > +++ b/drivers/infiniband/core/device.c
> > @@ -2521,6 +2521,7 @@ void ib_set_device_ops(struct ib_device *dev,
> const struct ib_device_ops *ops)
> >  	SET_DEVICE_OP(dev_ops, map_mr_sg_pi);
> >  	SET_DEVICE_OP(dev_ops, map_phys_fmr);
> >  	SET_DEVICE_OP(dev_ops, mmap);
> > +	SET_DEVICE_OP(dev_ops, mmap_free);
> >  	SET_DEVICE_OP(dev_ops, modify_ah);
> >  	SET_DEVICE_OP(dev_ops, modify_cq);
> >  	SET_DEVICE_OP(dev_ops, modify_device); diff --git
> > a/drivers/infiniband/core/rdma_core.c
> > b/drivers/infiniband/core/rdma_core.c
> > index ccf4d069c25c..7166741834c8 100644
> > --- a/drivers/infiniband/core/rdma_core.c
> > +++ b/drivers/infiniband/core/rdma_core.c
> > @@ -817,6 +817,7 @@ static void ufile_destroy_ucontext(struct
> ib_uverbs_file *ufile,
> >  	rdma_restrack_del(&ucontext->res);
> >
> >  	ib_dev->ops.dealloc_ucontext(ucontext);
> > +	rdma_user_mmap_entries_remove_free(ucontext);
> >  	kfree(ucontext);
> >
> >  	ufile->ucontext = NULL;
> > diff --git a/drivers/infiniband/core/uverbs_cmd.c
> > b/drivers/infiniband/core/uverbs_cmd.c
> > index 7ddd0e5bc6b3..44c0600245e4 100644
> > --- a/drivers/infiniband/core/uverbs_cmd.c
> > +++ b/drivers/infiniband/core/uverbs_cmd.c
> > @@ -254,6 +254,7 @@ static int ib_uverbs_get_context(struct
> > uverbs_attr_bundle *attrs)
> >
> >  	mutex_init(&ucontext->per_mm_list_lock);
> >  	INIT_LIST_HEAD(&ucontext->per_mm_list);
> > +	xa_init(&ucontext->mmap_xa);
> >
> >  	ret = get_unused_fd_flags(O_CLOEXEC);
> >  	if (ret < 0)
> > diff --git a/drivers/infiniband/core/uverbs_main.c
> > b/drivers/infiniband/core/uverbs_main.c
> > index 11c13c1381cf..37507cc27e8c 100644
> > --- a/drivers/infiniband/core/uverbs_main.c
> > +++ b/drivers/infiniband/core/uverbs_main.c
> > @@ -965,6 +965,111 @@ int rdma_user_mmap_io(struct ib_ucontext
> > *ucontext, struct vm_area_struct *vma,  }
> > EXPORT_SYMBOL(rdma_user_mmap_io);
> >
> > +static inline u64
> > +rdma_user_mmap_get_key(const struct rdma_user_mmap_entry
> *entry) {
> > +	return (u64)entry->mmap_page << PAGE_SHIFT; }
> > +
> > +struct rdma_user_mmap_entry *
> > +rdma_user_mmap_entry_get(struct ib_ucontext *ucontext, u64 key, u64
> > +len) {
> > +	struct rdma_user_mmap_entry *entry;
> > +	u64 mmap_page;
> > +
> > +	mmap_page = key >> PAGE_SHIFT;
> > +	if (mmap_page > U32_MAX)
> > +		return NULL;
> > +
> > +	entry = xa_load(&ucontext->mmap_xa, mmap_page);
> > +	if (!entry || rdma_user_mmap_get_key(entry) != key ||
> > +	    entry->length != len)
> > +		return NULL;
> > +
> > +	ibdev_dbg(ucontext->device,
> > +		  "mmap: obj[0x%p] key[%#llx] addr[%#llx] len[%#llx]
> removed\n",
> > +		  entry->obj, key, entry->address, entry->length);
> > +
> > +	return entry;
> > +}
> > +EXPORT_SYMBOL(rdma_user_mmap_entry_get);
> 
> Please add function description in kernel doc format for all newly
> EXPORT_SYMBOL() functions you introduced in RDMA/core.
Ok. Could you give me a reference to an example? Where should the 
Documentation be added to? 


> 
> > +
> > +/*
> > + * Note this locking scheme cannot support removal of entries, except
> > +during
> > + * ucontext destruction when the core code guarentees no concurrency.
> > + */
> > +u64 rdma_user_mmap_entry_insert(struct ib_ucontext *ucontext, void
> *obj,
> > +				u64 address, u64 length, u8 mmap_flag) {
> > +	struct rdma_user_mmap_entry *entry;
> > +	u32 next_mmap_page;
> > +	int err;
> > +
> > +	entry = kmalloc(sizeof(*entry), GFP_KERNEL);
> 
> It is worth to use kzalloc and not kmalloc.
> 
> Thanks

^ permalink raw reply

* RE: [PATCH v5 rdma-next 1/6] RDMA/core: Create mmap database and cookie helper functions
From: Michal Kalderon @ 2019-07-09 10:26 UTC (permalink / raw)
  To: Gal Pressman, Ariel Elior, jgg@ziepe.ca, dledford@redhat.com
  Cc: linux-rdma@vger.kernel.org, davem@davemloft.net,
	netdev@vger.kernel.org
In-Reply-To: <da67a821-1b26-c795-ff43-af17324f07e5@amazon.com>

> From: linux-rdma-owner@vger.kernel.org <linux-rdma-
> owner@vger.kernel.org> On Behalf Of Gal Pressman
> 
> On 08/07/2019 12:14, Michal Kalderon wrote:
> > diff --git a/drivers/infiniband/core/device.c
> > b/drivers/infiniband/core/device.c
> > index 8a6ccb936dfe..a830c2c5d691 100644
> > --- a/drivers/infiniband/core/device.c
> > +++ b/drivers/infiniband/core/device.c
> > @@ -2521,6 +2521,7 @@ void ib_set_device_ops(struct ib_device *dev,
> const struct ib_device_ops *ops)
> >  	SET_DEVICE_OP(dev_ops, map_mr_sg_pi);
> >  	SET_DEVICE_OP(dev_ops, map_phys_fmr);
> >  	SET_DEVICE_OP(dev_ops, mmap);
> > +	SET_DEVICE_OP(dev_ops, mmap_free);
> >  	SET_DEVICE_OP(dev_ops, modify_ah);
> >  	SET_DEVICE_OP(dev_ops, modify_cq);
> >  	SET_DEVICE_OP(dev_ops, modify_device); diff --git
> > a/drivers/infiniband/core/rdma_core.c
> > b/drivers/infiniband/core/rdma_core.c
> > index ccf4d069c25c..7166741834c8 100644
> > --- a/drivers/infiniband/core/rdma_core.c
> > +++ b/drivers/infiniband/core/rdma_core.c
> > @@ -817,6 +817,7 @@ static void ufile_destroy_ucontext(struct
> ib_uverbs_file *ufile,
> >  	rdma_restrack_del(&ucontext->res);
> >
> >  	ib_dev->ops.dealloc_ucontext(ucontext);
> > +	rdma_user_mmap_entries_remove_free(ucontext);
> 
> This should happen before dealloc_ucontext.
Right, will fix.
> 
> > +struct rdma_user_mmap_entry *
> > +rdma_user_mmap_entry_get(struct ib_ucontext *ucontext, u64 key, u64
> > +len) {
> > +	struct rdma_user_mmap_entry *entry;
> > +	u64 mmap_page;
> > +
> > +	mmap_page = key >> PAGE_SHIFT;
> > +	if (mmap_page > U32_MAX)
> > +		return NULL;
> > +
> > +	entry = xa_load(&ucontext->mmap_xa, mmap_page);
> > +	if (!entry || rdma_user_mmap_get_key(entry) != key ||
> 
> I wonder if the 'rdma_user_mmap_get_key(entry) != key' check is still
> needed.
I guess it's not since the key is used to get the entry. I'll remove the check

> 
> > +/*
> > + * This is only called when the ucontext is destroyed and there can
> > +be no
> > + * concurrent query via mmap or allocate on the xarray, thus we can
> > +be sure no
> > + * other thread is using the entry pointer. We also know that all the
> > +BAR
> > + * pages have either been zap'd or munmaped at this point.  Normal
> > +pages are
> > + * refcounted and will be freed at the proper time.
> > + */
> > +void rdma_user_mmap_entries_remove_free(struct ib_ucontext
> *ucontext)
> > +{
> > +	struct rdma_user_mmap_entry *entry;
> > +	unsigned long mmap_page;
> > +
> > +	xa_for_each(&ucontext->mmap_xa, mmap_page, entry) {
> > +		xa_erase(&ucontext->mmap_xa, mmap_page);
> > +
> > +		ibdev_dbg(ucontext->device,
> > +			  "mmap: obj[0x%p] key[%#llx] addr[%#llx] len[%#llx]
> removed\n",
> > +			  entry->obj, rdma_user_mmap_get_key(entry),
> > +			  entry->address, entry->length);
> > +		if (ucontext->device->ops.mmap_free)
> > +			ucontext->device->ops.mmap_free(entry->address,
> > +							entry->length,
> > +							entry->mmap_flag);
> 
> Pass entry instead?
> 
> > +		kfree(entry);
> > +	}
> > +}
> > +
> >  void uverbs_user_mmap_disassociate(struct ib_uverbs_file *ufile)  {
> >  	struct rdma_umap_priv *priv, *next_priv; diff --git
> > a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index
> > 26e9c2594913..54ce3fdae180 100644
> > --- a/include/rdma/ib_verbs.h
> > +++ b/include/rdma/ib_verbs.h
> > @@ -1425,6 +1425,8 @@ struct ib_ucontext {
> >  	 * Implementation details of the RDMA core, don't use in drivers:
> >  	 */
> >  	struct rdma_restrack_entry res;
> > +	struct xarray mmap_xa;
> > +	u32 mmap_xa_page;
> >  };
> >
> >  struct ib_uobject {
> > @@ -2311,6 +2313,7 @@ struct ib_device_ops {
> >  			      struct ib_udata *udata);
> >  	void (*dealloc_ucontext)(struct ib_ucontext *context);
> >  	int (*mmap)(struct ib_ucontext *context, struct vm_area_struct
> > *vma);
> > +	void (*mmap_free)(u64 address, u64 length, u8 mmap_flag);
> 
> I feel like this callback needs some documentation.
> 
> >  	void (*disassociate_ucontext)(struct ib_ucontext *ibcontext);
> >  	int (*alloc_pd)(struct ib_pd *pd, struct ib_udata *udata);
> >  	void (*dealloc_pd)(struct ib_pd *pd, struct ib_udata *udata); @@
> > -2706,9 +2709,23 @@ void  ib_set_client_data(struct ib_device *device,
> > struct ib_client *client,  void ib_set_device_ops(struct ib_device *device,
> >  		       const struct ib_device_ops *ops);
> >
> > +#define RDMA_USER_MMAP_INVALID U64_MAX struct
> rdma_user_mmap_entry {
> > +	void  *obj;
> 
> I know EFA is the culprit here, but please remove the extra space :).
> 
> > +	u64 address;
> > +	u64 length;
> > +	u32 mmap_page;
> > +	u8 mmap_flag;
> > +};
> > +

^ permalink raw reply

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Danil Kipnis @ 2019-07-09  9:55 UTC (permalink / raw)
  To: Jack Wang
  Cc: linux-block, linux-rdma, axboe, Christoph Hellwig, Sagi Grimberg,
	bvanassche, jgg, dledford, Roman Pen, gregkh
In-Reply-To: <20190620150337.7847-1-jinpuwang@gmail.com>

Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,

Could you please provide some feedback to the IBNBD driver and the
IBTRS library?
So far we addressed all the requests provided by the community and
continue to maintain our code up-to-date with the upstream kernel
while having an extra compatibility layer for older kernels in our
out-of-tree repository.
I understand that SRP and NVMEoF which are in the kernel already do
provide equivalent functionality for the majority of the use cases.
IBNBD on the other hand is showing higher performance and more
importantly includes the IBTRS - a general purpose library to
establish connections and transport BIO-like read/write sg-lists over
RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
I believe IBNBD does meet the kernel coding standards, it doesn't have
a lot of users, while SRP and NVMEoF are widely accepted. Do you think
it would make sense for us to rework our patchset and try pushing it
for staging tree first, so that we can proof IBNBD is well maintained,
beneficial for the eco-system, find a proper location for it within
block/rdma subsystems? This would make it easier for people to try it
out and would also be a huge step for us in terms of maintenance
effort.
The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
near future). Do you think it would make sense to rename the driver to
RNBD/RTRS?

Thank you,
Best Regards,
Danil

On Thu, Jun 20, 2019 at 5:03 PM Jack Wang <jinpuwang@gmail.com> wrote:
>
> Hi all,
>
> Here is v4 of IBNBD/IBTRS patches, which have minor changes
>
>  Changelog
>  ---------
> v4:
>   o Protocol extended to transport IO priorities
>   o Support for Mellanox ConnectX-4/X-5
>   o Minor sysfs extentions (display access mode on server side)
>   o Bug fixes: cleaning up sysfs folders, race on deallocation of resources
>   o Style fixes
>
> v3:
>   o Sparse fixes:
>      - le32 -> le16 conversion
>      - pcpu and RCU wrong declaration
>      - sysfs: dynamically alloc array of sockaddr structures to reduce
>            size of a stack frame
>
>   o Rename sysfs folder on client and server sides to show source and
>     destination addresses of the connection, i.e.:
>            .../<session-name>/paths/<src@dst>/
>
>   o Remove external inclusions from Makefiles.
>   * https://lwn.net/Articles/756994/
>
> v2:
>   o IBNBD:
>      - No legacy request IO mode, only MQ is left.
>
>   o IBTRS:
>      - No FMR registration, only FR is left.
>
>   * https://lwn.net/Articles/755075/
>
> v1:
>   - IBTRS: load-balancing and IO fail-over using multipath features were added.
>
>   - Major parts of the code were rewritten, simplified and overall code
>     size was reduced by a quarter.
>
>   * https://lwn.net/Articles/746342/
>
> v0:
>   - Initial submission
>
>   * https://lwn.net/Articles/718181/
>
>
>  Introduction
>  -------------
>
> IBTRS (InfiniBand Transport) is a reliable high speed transport library
> which allows for establishing connection between client and server
> machines via RDMA. It is based on RDMA-CM, so expect also to support RoCE
> and iWARP, but we mainly tested in IB environment. It is optimized to
> transfer (read/write) IO blocks in the sense that it follows the BIO
> semantics of providing the possibility to either write data from a
> scatter-gather list to the remote side or to request ("read") data
> transfer from the remote side into a given set of buffers.
>
> IBTRS is multipath capable and provides I/O fail-over and load-balancing
> functionality, i.e. in IBTRS terminology, an IBTRS path is a set of RDMA
> CMs and particular path is selected according to the load-balancing policy.
> It can be used for other components not bind to IBNBD.
>
>
> IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
> (client and server) that allow for remote access of a block device on
> the server over IBTRS protocol. After being mapped, the remote block
> devices can be accessed on the client side as local block devices.
> Internally IBNBD uses IBTRS as an RDMA transport library.
>
>
>    - IBNBD/IBTRS is developed in order to map thin provisioned volumes,
>      thus internal protocol is simple.
>    - IBTRS was developed as an independent RDMA transport library, which
>      supports fail-over and load-balancing policies using multipath, thus
>      it can be used for any other IO needs rather than only for block
>      device.
>    - IBNBD/IBTRS is fast.
>      Old comparison results:
>      https://www.spinics.net/lists/linux-rdma/msg48799.html
>      New comparison results: see performance measurements section below.
>
> Key features of IBTRS transport library and IBNBD block device:
>
> o High throughput and low latency due to:
>    - Only two RDMA messages per IO.
>    - IMM InfiniBand messages on responses to reduce round trip latency.
>    - Simplified memory management: memory allocation happens once on
>      server side when IBTRS session is established.
>
> o IO fail-over and load-balancing by using multipath.  According to
>   our test loads additional path brings ~20% of bandwidth.
>
> o Simple configuration of IBNBD:
>    - Server side is completely passive: volumes do not need to be
>      explicitly exported.
>    - Only IB port GID and device path needed on client side to map
>      a block device.
>    - A device is remapped automatically i.e. after storage reboot.
>
> Commits for kernel can be found here:
>    https://github.com/ionos-enterprise/ibnbd/tree/linux-5.2-rc3--ibnbd-v4
> The out-of-tree modules are here:
>    https://github.com/ionos-enterprise/ibnbd
>
> Vault 2017 presentation:
>   https://events.static.linuxfound.org/sites/events/files/slides/IBNBD-Vault-2017.pdf
>
>  Performance measurements
>  ------------------------
>
> o IBNBD and NVMEoRDMA
>
>   Performance results for the v5.2-rc3 kernel
>   link: https://github.com/ionos-enterprise/ibnbd/tree/develop/performance/v4-v5.2-rc3
>
> Roman Pen (25):
>   sysfs: export sysfs_remove_file_self()
>   ibtrs: public interface header to establish RDMA connections
>   ibtrs: private headers with IBTRS protocol structs and helpers
>   ibtrs: core: lib functions shared between client and server modules
>   ibtrs: client: private header with client structs and functions
>   ibtrs: client: main functionality
>   ibtrs: client: statistics functions
>   ibtrs: client: sysfs interface functions
>   ibtrs: server: private header with server structs and functions
>   ibtrs: server: main functionality
>   ibtrs: server: statistics functions
>   ibtrs: server: sysfs interface functions
>   ibtrs: include client and server modules into kernel compilation
>   ibtrs: a bit of documentation
>   ibnbd: private headers with IBNBD protocol structs and helpers
>   ibnbd: client: private header with client structs and functions
>   ibnbd: client: main functionality
>   ibnbd: client: sysfs interface functions
>   ibnbd: server: private header with server structs and functions
>   ibnbd: server: main functionality
>   ibnbd: server: functionality for IO submission to file or block dev
>   ibnbd: server: sysfs interface functions
>   ibnbd: include client and server modules into kernel compilation
>   ibnbd: a bit of documentation
>   MAINTAINERS: Add maintainer for IBNBD/IBTRS modules
>
>  MAINTAINERS                                   |   14 +
>  drivers/block/Kconfig                         |    2 +
>  drivers/block/Makefile                        |    1 +
>  drivers/block/ibnbd/Kconfig                   |   24 +
>  drivers/block/ibnbd/Makefile                  |   13 +
>  drivers/block/ibnbd/README                    |  315 ++
>  drivers/block/ibnbd/ibnbd-clt-sysfs.c         |  691 ++++
>  drivers/block/ibnbd/ibnbd-clt.c               | 1832 +++++++++++
>  drivers/block/ibnbd/ibnbd-clt.h               |  166 +
>  drivers/block/ibnbd/ibnbd-log.h               |   59 +
>  drivers/block/ibnbd/ibnbd-proto.h             |  378 +++
>  drivers/block/ibnbd/ibnbd-srv-dev.c           |  408 +++
>  drivers/block/ibnbd/ibnbd-srv-dev.h           |  143 +
>  drivers/block/ibnbd/ibnbd-srv-sysfs.c         |  270 ++
>  drivers/block/ibnbd/ibnbd-srv.c               |  945 ++++++
>  drivers/block/ibnbd/ibnbd-srv.h               |   94 +
>  drivers/infiniband/Kconfig                    |    1 +
>  drivers/infiniband/ulp/Makefile               |    1 +
>  drivers/infiniband/ulp/ibtrs/Kconfig          |   22 +
>  drivers/infiniband/ulp/ibtrs/Makefile         |   15 +
>  drivers/infiniband/ulp/ibtrs/README           |  385 +++
>  .../infiniband/ulp/ibtrs/ibtrs-clt-stats.c    |  447 +++
>  .../infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c    |  514 +++
>  drivers/infiniband/ulp/ibtrs/ibtrs-clt.c      | 2844 +++++++++++++++++
>  drivers/infiniband/ulp/ibtrs/ibtrs-clt.h      |  308 ++
>  drivers/infiniband/ulp/ibtrs/ibtrs-log.h      |   84 +
>  drivers/infiniband/ulp/ibtrs/ibtrs-pri.h      |  463 +++
>  .../infiniband/ulp/ibtrs/ibtrs-srv-stats.c    |  103 +
>  .../infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c    |  303 ++
>  drivers/infiniband/ulp/ibtrs/ibtrs-srv.c      | 1998 ++++++++++++
>  drivers/infiniband/ulp/ibtrs/ibtrs-srv.h      |  170 +
>  drivers/infiniband/ulp/ibtrs/ibtrs.c          |  610 ++++
>  drivers/infiniband/ulp/ibtrs/ibtrs.h          |  318 ++
>  fs/sysfs/file.c                               |    1 +
>  34 files changed, 13942 insertions(+)
>  create mode 100644 drivers/block/ibnbd/Kconfig
>  create mode 100644 drivers/block/ibnbd/Makefile
>  create mode 100644 drivers/block/ibnbd/README
>  create mode 100644 drivers/block/ibnbd/ibnbd-clt-sysfs.c
>  create mode 100644 drivers/block/ibnbd/ibnbd-clt.c
>  create mode 100644 drivers/block/ibnbd/ibnbd-clt.h
>  create mode 100644 drivers/block/ibnbd/ibnbd-log.h
>  create mode 100644 drivers/block/ibnbd/ibnbd-proto.h
>  create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.c
>  create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.h
>  create mode 100644 drivers/block/ibnbd/ibnbd-srv-sysfs.c
>  create mode 100644 drivers/block/ibnbd/ibnbd-srv.c
>  create mode 100644 drivers/block/ibnbd/ibnbd-srv.h
>  create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig
>  create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile
>  create mode 100644 drivers/infiniband/ulp/ibtrs/README
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-log.h
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.h
>
> --
> 2.17.1
>

^ permalink raw reply

* [RFC, BUG] ibverbs/rxe: Potential use after free
From: Maksym Planeta @ 2019-07-09  9:53 UTC (permalink / raw)
  To: Moni Shoua, Doug Ledford, Jason Gunthorpe, linux-rdma; +Cc: Maksym Planeta

Hello All,

SoftRoCE driver (drivers/infiniband/sw/rxe) contains several tasklets: 
rxe_requester (in rxe_req.c), rxe_completer (rxe_comp.c), and 
rxe_responder (in rxe_resp.c).

All of these tasklets can be scheduled for asynchronous execution by 
calling function rxe_run_task (rxe_task.c) with sched=1. In this case 
the tasklet is passed to tasklet_schedule.

These tasklets receive as a parameter a queue pair (QP) object that is 
allocated from a dedicated memory pool implemented in rxe_pool.c and is 
reference counted.

The first thing this tasklets do is increasing reference count for a QP 
object by calling rxe_add_ref (call kref_get).

I thin this behavior is erroneous. It is possible for reference counter 
to go down to zero, when a tasklet is scheduled for execution. This can 
happen, for example, if a tasklet is scheduled and then the process that 
created a QP is destroyed before the tasklet runs. When a process is 
destroyed, the reference count for QP will go down and can reach zero.

As result, the QP object can be already destroyed by the time the 
tasklet is actually run. This will result into use-after-free.

The proper way to implement this case would be to call rxe_add_ref 
*before* tasklet_schedule: Either inside rxe_run_task, when sched equals 
1, or before rxe_run_task.

If you think my reasoning is correct, I can prepare a patch.

I would be happy to see any comment from you before that, as I'm new to 
kernel programming.

-- 
Regards,
Maksym Planeta

^ permalink raw reply

* Re: [PATCH v5 rdma-next 2/6] RDMA/efa: Use the common mmap_xa helpers
From: Gal Pressman @ 2019-07-09  9:02 UTC (permalink / raw)
  To: Michal Kalderon, ariel.elior, jgg, dledford, galpress
  Cc: linux-rdma, davem, netdev
In-Reply-To: <20190708091503.14723-3-michal.kalderon@marvell.com>

On 08/07/2019 12:14, Michal Kalderon wrote:

Hi, a few nits:

> Remove the functions related to managing the mmap_xa database.
> This code was copied to the ib_core. Use the common API's instead.
> 
> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
> ---
>  drivers/infiniband/hw/efa/efa.h       |   3 +-
>  drivers/infiniband/hw/efa/efa_main.c  |   1 +
>  drivers/infiniband/hw/efa/efa_verbs.c | 183 ++++++++--------------------------
>  3 files changed, 42 insertions(+), 145 deletions(-)
> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
> index df77bc312a25..5dff892da161 100644
> --- a/drivers/infiniband/hw/efa/efa_verbs.c
> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
> @@ -13,34 +13,15 @@
>  
>  #include "efa.h"
>  
> -#define EFA_MMAP_FLAG_SHIFT 56
> -#define EFA_MMAP_PAGE_MASK GENMASK(EFA_MMAP_FLAG_SHIFT - 1, 0)
> -#define EFA_MMAP_INVALID U64_MAX
> -

Don't delete the blank line please.

>  enum {
>  	EFA_MMAP_DMA_PAGE = 0,
>  	EFA_MMAP_IO_WC,
>  	EFA_MMAP_IO_NC,
>  };
> -
>  #define EFA_AENQ_ENABLED_GROUPS \
>  	(BIT(EFA_ADMIN_FATAL_ERROR) | BIT(EFA_ADMIN_WARNING) | \
>  	 BIT(EFA_ADMIN_NOTIFICATION) | BIT(EFA_ADMIN_KEEP_ALIVE))
>  
> -struct efa_mmap_entry {
> -	void  *obj;
> -	u64 address;
> -	u64 length;
> -	u32 mmap_page;
> -	u8 mmap_flag;
> -};
> -
> -static inline u64 get_mmap_key(const struct efa_mmap_entry *efa)
> -{
> -	return ((u64)efa->mmap_flag << EFA_MMAP_FLAG_SHIFT) |
> -	       ((u64)efa->mmap_page << PAGE_SHIFT);
> -}
> -
>  #define EFA_CHUNK_PAYLOAD_SHIFT       12
>  #define EFA_CHUNK_PAYLOAD_SIZE        BIT(EFA_CHUNK_PAYLOAD_SHIFT)
>  #define EFA_CHUNK_PAYLOAD_PTR_SIZE    8
> @@ -145,105 +126,7 @@ static void *efa_zalloc_mapped(struct efa_dev *dev, dma_addr_t *dma_addr,
>  	return addr;
>  }
>  
> -/*
> - * This is only called when the ucontext is destroyed and there can be no
> - * concurrent query via mmap or allocate on the xarray, thus we can be sure no
> - * other thread is using the entry pointer. We also know that all the BAR
> - * pages have either been zap'd or munmaped at this point.  Normal pages are
> - * refcounted and will be freed at the proper time.
> - */
> -static void mmap_entries_remove_free(struct efa_dev *dev,
> -				     struct efa_ucontext *ucontext)
> -{
> -	struct efa_mmap_entry *entry;
> -	unsigned long mmap_page;
>  
> -	xa_for_each(&ucontext->mmap_xa, mmap_page, entry) {
> -		xa_erase(&ucontext->mmap_xa, mmap_page);
> -
> -		ibdev_dbg(
> -			&dev->ibdev,
> -			"mmap: obj[0x%p] key[%#llx] addr[%#llx] len[%#llx] removed\n",
> -			entry->obj, get_mmap_key(entry), entry->address,
> -			entry->length);
> -		if (entry->mmap_flag == EFA_MMAP_DMA_PAGE)
> -			/* DMA mapping is already gone, now free the pages */
> -			free_pages_exact(phys_to_virt(entry->address),
> -					 entry->length);
> -		kfree(entry);
> -	}
> -}
> -
> -static struct efa_mmap_entry *mmap_entry_get(struct efa_dev *dev,
> -					     struct efa_ucontext *ucontext,
> -					     u64 key, u64 len)
> -{
> -	struct efa_mmap_entry *entry;
> -	u64 mmap_page;
> -
> -	mmap_page = (key & EFA_MMAP_PAGE_MASK) >> PAGE_SHIFT;
> -	if (mmap_page > U32_MAX)
> -		return NULL;
> -
> -	entry = xa_load(&ucontext->mmap_xa, mmap_page);
> -	if (!entry || get_mmap_key(entry) != key || entry->length != len)
> -		return NULL;
> -
> -	ibdev_dbg(&dev->ibdev,
> -		  "mmap: obj[0x%p] key[%#llx] addr[%#llx] len[%#llx] removed\n",
> -		  entry->obj, key, entry->address, entry->length);
> -
> -	return entry;
> -}
> -
> -/*
> - * Note this locking scheme cannot support removal of entries, except during
> - * ucontext destruction when the core code guarentees no concurrency.
> - */
> -static u64 mmap_entry_insert(struct efa_dev *dev, struct efa_ucontext *ucontext,
> -			     void *obj, u64 address, u64 length, u8 mmap_flag)
> -{
> -	struct efa_mmap_entry *entry;
> -	u32 next_mmap_page;
> -	int err;
> -
> -	entry = kmalloc(sizeof(*entry), GFP_KERNEL);
> -	if (!entry)
> -		return EFA_MMAP_INVALID;
> -
> -	entry->obj = obj;
> -	entry->address = address;
> -	entry->length = length;
> -	entry->mmap_flag = mmap_flag;
> -
> -	xa_lock(&ucontext->mmap_xa);
> -	if (check_add_overflow(ucontext->mmap_xa_page,
> -			       (u32)(length >> PAGE_SHIFT),
> -			       &next_mmap_page))
> -		goto err_unlock;
> -
> -	entry->mmap_page = ucontext->mmap_xa_page;
> -	ucontext->mmap_xa_page = next_mmap_page;
> -	err = __xa_insert(&ucontext->mmap_xa, entry->mmap_page, entry,
> -			  GFP_KERNEL);
> -	if (err)
> -		goto err_unlock;
> -
> -	xa_unlock(&ucontext->mmap_xa);
> -
> -	ibdev_dbg(
> -		&dev->ibdev,
> -		"mmap: obj[0x%p] addr[%#llx], len[%#llx], key[%#llx] inserted\n",
> -		entry->obj, entry->address, entry->length, get_mmap_key(entry));
> -
> -	return get_mmap_key(entry);
> -
> -err_unlock:
> -	xa_unlock(&ucontext->mmap_xa);
> -	kfree(entry);
> -	return EFA_MMAP_INVALID;
> -
> -}
>  

You left two extra blank lines between efa_zalloc_mapped and efa_query_device.

>  int efa_query_device(struct ib_device *ibdev,
>  		     struct ib_device_attr *props,
> @@ -488,45 +371,52 @@ static int qp_mmap_entries_setup(struct efa_qp *qp,
>  				 struct efa_com_create_qp_params *params,
>  				 struct efa_ibv_create_qp_resp *resp)
>  {
> +	u64 address;
> +	u64 length;

Line break.

>  	/*
>  	 * Once an entry is inserted it might be mmapped, hence cannot be
>  	 * cleaned up until dealloc_ucontext.
>  	 */
>  	resp->sq_db_mmap_key =
> -		mmap_entry_insert(dev, ucontext, qp,
> -				  dev->db_bar_addr + resp->sq_db_offset,
> -				  PAGE_SIZE, EFA_MMAP_IO_NC);
> -	if (resp->sq_db_mmap_key == EFA_MMAP_INVALID)
> +		rdma_user_mmap_entry_insert(&ucontext->ibucontext, qp,
> +					    dev->db_bar_addr +
> +					    resp->sq_db_offset,
> +					    PAGE_SIZE, EFA_MMAP_IO_NC);
> +	if (resp->sq_db_mmap_key == RDMA_USER_MMAP_INVALID)
>  		return -ENOMEM;
>  
>  	resp->sq_db_offset &= ~PAGE_MASK;
>  
> +	address = dev->mem_bar_addr + resp->llq_desc_offset;
> +	length = PAGE_ALIGN(params->sq_ring_size_in_bytes +
> +			    (resp->llq_desc_offset & ~PAGE_MASK));
>  	resp->llq_desc_mmap_key =
> -		mmap_entry_insert(dev, ucontext, qp,
> -				  dev->mem_bar_addr + resp->llq_desc_offset,
> -				  PAGE_ALIGN(params->sq_ring_size_in_bytes +
> -					     (resp->llq_desc_offset & ~PAGE_MASK)),
> -				  EFA_MMAP_IO_WC);
> -	if (resp->llq_desc_mmap_key == EFA_MMAP_INVALID)
> +		rdma_user_mmap_entry_insert(&ucontext->ibucontext, qp,
> +					    address,
> +					    length,
> +					    EFA_MMAP_IO_WC);
> +	if (resp->llq_desc_mmap_key == RDMA_USER_MMAP_INVALID)
>  		return -ENOMEM;
>  
>  	resp->llq_desc_offset &= ~PAGE_MASK;
>  
>  	if (qp->rq_size) {
> +		address = dev->db_bar_addr + resp->rq_db_offset;
>  		resp->rq_db_mmap_key =
> -			mmap_entry_insert(dev, ucontext, qp,
> -					  dev->db_bar_addr + resp->rq_db_offset,
> -					  PAGE_SIZE, EFA_MMAP_IO_NC);
> -		if (resp->rq_db_mmap_key == EFA_MMAP_INVALID)
> +			rdma_user_mmap_entry_insert(&ucontext->ibucontext, qp,
> +						    address, PAGE_SIZE,
> +						    EFA_MMAP_IO_NC);
> +		if (resp->rq_db_mmap_key == RDMA_USER_MMAP_INVALID)
>  			return -ENOMEM;
>  
>  		resp->rq_db_offset &= ~PAGE_MASK;
>  
> +		address = virt_to_phys(qp->rq_cpu_addr);
>  		resp->rq_mmap_key =
> -			mmap_entry_insert(dev, ucontext, qp,
> -					  virt_to_phys(qp->rq_cpu_addr),
> -					  qp->rq_size, EFA_MMAP_DMA_PAGE);
> -		if (resp->rq_mmap_key == EFA_MMAP_INVALID)
> +			rdma_user_mmap_entry_insert(&ucontext->ibucontext, qp,
> +						    address, qp->rq_size,
> +						    EFA_MMAP_DMA_PAGE);
> +		if (resp->rq_mmap_key == RDMA_USER_MMAP_INVALID)
>  			return -ENOMEM;
>  
>  		resp->rq_mmap_size = qp->rq_size;
> @@ -875,11 +765,13 @@ void efa_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
>  static int cq_mmap_entries_setup(struct efa_dev *dev, struct efa_cq *cq,
>  				 struct efa_ibv_create_cq_resp *resp)
>  {
> +	struct efa_ucontext *ucontext = cq->ucontext;

Line break.

>  	resp->q_mmap_size = cq->size;
> -	resp->q_mmap_key = mmap_entry_insert(dev, cq->ucontext, cq,
> -					     virt_to_phys(cq->cpu_addr),
> -					     cq->size, EFA_MMAP_DMA_PAGE);
> -	if (resp->q_mmap_key == EFA_MMAP_INVALID)
> +	resp->q_mmap_key =
> +		rdma_user_mmap_entry_insert(&ucontext->ibucontext, cq,
> +					    virt_to_phys(cq->cpu_addr),
> +					    cq->size, EFA_MMAP_DMA_PAGE);
> +	if (resp->q_mmap_key == RDMA_USER_MMAP_INVALID)
>  		return -ENOMEM;
>  
>  	return 0;
> @@ -1531,7 +1423,6 @@ int efa_alloc_ucontext(struct ib_ucontext *ibucontext, struct ib_udata *udata)
>  		goto err_out;
>  
>  	ucontext->uarn = result.uarn;
> -	xa_init(&ucontext->mmap_xa);
>  
>  	resp.cmds_supp_udata_mask |= EFA_USER_CMDS_SUPP_UDATA_QUERY_DEVICE;
>  	resp.cmds_supp_udata_mask |= EFA_USER_CMDS_SUPP_UDATA_CREATE_AH;
> @@ -1560,19 +1451,25 @@ void efa_dealloc_ucontext(struct ib_ucontext *ibucontext)
>  	struct efa_ucontext *ucontext = to_eucontext(ibucontext);
>  	struct efa_dev *dev = to_edev(ibucontext->device);
>  
> -	mmap_entries_remove_free(dev, ucontext);
>  	efa_dealloc_uar(dev, ucontext->uarn);
>  }
>  
> +void efa_mmap_free(u64 address, u64 length, u8 mmap_flag)
> +{
> +	/* DMA mapping is already gone, now free the pages */
> +	if (mmap_flag == EFA_MMAP_DMA_PAGE)
> +		free_pages_exact(phys_to_virt(address), length);
> +}
> +
>  static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
>  		      struct vm_area_struct *vma, u64 key, u64 length)
>  {
> -	struct efa_mmap_entry *entry;
> +	struct rdma_user_mmap_entry *entry;
>  	unsigned long va;
>  	u64 pfn;
>  	int err;
>  
> -	entry = mmap_entry_get(dev, ucontext, key, length);
> +	entry = rdma_user_mmap_entry_get(&ucontext->ibucontext, key, length);
>  	if (!entry) {
>  		ibdev_dbg(&dev->ibdev, "key[%#llx] does not have valid entry\n",
>  			  key);
> 

^ permalink raw reply

* Re: [PATCH v5 rdma-next 1/6] RDMA/core: Create mmap database and cookie helper functions
From: Leon Romanovsky @ 2019-07-09  7:02 UTC (permalink / raw)
  To: Michal Kalderon
  Cc: ariel.elior, jgg, dledford, galpress, linux-rdma, davem, netdev
In-Reply-To: <20190708091503.14723-2-michal.kalderon@marvell.com>

On Mon, Jul 08, 2019 at 12:14:58PM +0300, Michal Kalderon wrote:
> Create some common API's for adding entries to a xa_mmap.
> Searching for an entry and freeing one.
>
> The code was copied from the efa driver almost as is, just renamed
> function to be generic and not efa specific.
>
> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
> ---
>  drivers/infiniband/core/device.c      |   1 +
>  drivers/infiniband/core/rdma_core.c   |   1 +
>  drivers/infiniband/core/uverbs_cmd.c  |   1 +
>  drivers/infiniband/core/uverbs_main.c | 105 ++++++++++++++++++++++++++++++++++
>  include/rdma/ib_verbs.h               |  32 +++++++++++
>  5 files changed, 140 insertions(+)
>
> diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> index 8a6ccb936dfe..a830c2c5d691 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -2521,6 +2521,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
>  	SET_DEVICE_OP(dev_ops, map_mr_sg_pi);
>  	SET_DEVICE_OP(dev_ops, map_phys_fmr);
>  	SET_DEVICE_OP(dev_ops, mmap);
> +	SET_DEVICE_OP(dev_ops, mmap_free);
>  	SET_DEVICE_OP(dev_ops, modify_ah);
>  	SET_DEVICE_OP(dev_ops, modify_cq);
>  	SET_DEVICE_OP(dev_ops, modify_device);
> diff --git a/drivers/infiniband/core/rdma_core.c b/drivers/infiniband/core/rdma_core.c
> index ccf4d069c25c..7166741834c8 100644
> --- a/drivers/infiniband/core/rdma_core.c
> +++ b/drivers/infiniband/core/rdma_core.c
> @@ -817,6 +817,7 @@ static void ufile_destroy_ucontext(struct ib_uverbs_file *ufile,
>  	rdma_restrack_del(&ucontext->res);
>
>  	ib_dev->ops.dealloc_ucontext(ucontext);
> +	rdma_user_mmap_entries_remove_free(ucontext);
>  	kfree(ucontext);
>
>  	ufile->ucontext = NULL;
> diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
> index 7ddd0e5bc6b3..44c0600245e4 100644
> --- a/drivers/infiniband/core/uverbs_cmd.c
> +++ b/drivers/infiniband/core/uverbs_cmd.c
> @@ -254,6 +254,7 @@ static int ib_uverbs_get_context(struct uverbs_attr_bundle *attrs)
>
>  	mutex_init(&ucontext->per_mm_list_lock);
>  	INIT_LIST_HEAD(&ucontext->per_mm_list);
> +	xa_init(&ucontext->mmap_xa);
>
>  	ret = get_unused_fd_flags(O_CLOEXEC);
>  	if (ret < 0)
> diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
> index 11c13c1381cf..37507cc27e8c 100644
> --- a/drivers/infiniband/core/uverbs_main.c
> +++ b/drivers/infiniband/core/uverbs_main.c
> @@ -965,6 +965,111 @@ int rdma_user_mmap_io(struct ib_ucontext *ucontext, struct vm_area_struct *vma,
>  }
>  EXPORT_SYMBOL(rdma_user_mmap_io);
>
> +static inline u64
> +rdma_user_mmap_get_key(const struct rdma_user_mmap_entry *entry)
> +{
> +	return (u64)entry->mmap_page << PAGE_SHIFT;
> +}
> +
> +struct rdma_user_mmap_entry *
> +rdma_user_mmap_entry_get(struct ib_ucontext *ucontext, u64 key, u64 len)
> +{
> +	struct rdma_user_mmap_entry *entry;
> +	u64 mmap_page;
> +
> +	mmap_page = key >> PAGE_SHIFT;
> +	if (mmap_page > U32_MAX)
> +		return NULL;
> +
> +	entry = xa_load(&ucontext->mmap_xa, mmap_page);
> +	if (!entry || rdma_user_mmap_get_key(entry) != key ||
> +	    entry->length != len)
> +		return NULL;
> +
> +	ibdev_dbg(ucontext->device,
> +		  "mmap: obj[0x%p] key[%#llx] addr[%#llx] len[%#llx] removed\n",
> +		  entry->obj, key, entry->address, entry->length);
> +
> +	return entry;
> +}
> +EXPORT_SYMBOL(rdma_user_mmap_entry_get);

Please add function description in kernel doc format for all newly EXPORT_SYMBOL()
functions you introduced in RDMA/core.

> +
> +/*
> + * Note this locking scheme cannot support removal of entries, except during
> + * ucontext destruction when the core code guarentees no concurrency.
> + */
> +u64 rdma_user_mmap_entry_insert(struct ib_ucontext *ucontext, void *obj,
> +				u64 address, u64 length, u8 mmap_flag)
> +{
> +	struct rdma_user_mmap_entry *entry;
> +	u32 next_mmap_page;
> +	int err;
> +
> +	entry = kmalloc(sizeof(*entry), GFP_KERNEL);

It is worth to use kzalloc and not kmalloc.

Thanks

^ permalink raw reply

* RE: [PATCH rdma-next 2/2] IB: Support netlink commands in non init_net net namespaces
From: Parav Pandit @ 2019-07-09  6:55 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky
  Cc: Doug Ledford, Leon Romanovsky, RDMA mailing list, Steve Wise
In-Reply-To: <20190708202023.GA8020@ziepe.ca>



> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Tuesday, July 9, 2019 1:50 AM
> To: Leon Romanovsky <leon@kernel.org>
> Cc: Doug Ledford <dledford@redhat.com>; Leon Romanovsky
> <leonro@mellanox.com>; RDMA mailing list <linux-rdma@vger.kernel.org>;
> Parav Pandit <parav@mellanox.com>; Steve Wise
> <swise@opengridcomputing.com>
> Subject: Re: [PATCH rdma-next 2/2] IB: Support netlink commands in non
> init_net net namespaces
> 
> On Thu, Jul 04, 2019 at 04:04:02PM +0300, Leon Romanovsky wrote:
> > -int rdma_nl_unicast(struct sk_buff *skb, u32 pid)
> > +int rdma_nl_unicast(struct net *net, struct sk_buff *skb, u32 pid)
> >  {
> > +	struct rdma_dev_net *rnet = net_generic(net, rdma_dev_net_id);
> 
> This should be a proper type safe accessor in all places
> 
Adding it.

> > -void rdma_nl_exit(void)
> > +void rdma_nl_net_exit(struct rdma_dev_net *rnet)
> >  {
> > -	int idx;
> > -
> > -	for (idx = 0; idx < RDMA_NL_NUM_CLIENTS; idx++)
> > -		rdma_nl_unregister(idx);
> 
> There should be a WARN_ON during the module unload that no NL clients are
> still registered
> 
Adding it.

> Thanks,
> Jason

^ permalink raw reply

* Re: [PATCH rdma-next 2/2] IB: Support netlink commands in non init_net net namespaces
From: Leon Romanovsky @ 2019-07-09  6:38 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Doug Ledford, RDMA mailing list, Parav Pandit, Steve Wise
In-Reply-To: <20190708202023.GA8020@ziepe.ca>

On Mon, Jul 08, 2019 at 05:20:23PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 04, 2019 at 04:04:02PM +0300, Leon Romanovsky wrote:
> > -int rdma_nl_unicast(struct sk_buff *skb, u32 pid)
> > +int rdma_nl_unicast(struct net *net, struct sk_buff *skb, u32 pid)
> >  {
> > +	struct rdma_dev_net *rnet = net_generic(net, rdma_dev_net_id);
>
> This should be a proper type safe accessor in all places

"const struct net *net" and not "struct net *net"?

>
> > -void rdma_nl_exit(void)
> > +void rdma_nl_net_exit(struct rdma_dev_net *rnet)
> >  {
> > -	int idx;
> > -
> > -	for (idx = 0; idx < RDMA_NL_NUM_CLIENTS; idx++)
> > -		rdma_nl_unregister(idx);
>
> There should be a WARN_ON during the module unload that no NL clients
> are still registered

IMHO, the usage of WARN_ON is overrated.

>
> Thanks,
> Jason

^ permalink raw reply

* RE: [PATCH rdma-next] RDMA/core: Annotate destroy of mutex to ensure that it is released as unlocked
From: Parav Pandit @ 2019-07-09  5:25 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky
  Cc: Doug Ledford, RDMA mailing list, Leon Romanovsky
In-Reply-To: <20190708200114.GA25699@ziepe.ca>



> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Tuesday, July 9, 2019 1:31 AM
> To: Leon Romanovsky <leon@kernel.org>
> Cc: Doug Ledford <dledford@redhat.com>; Parav Pandit
> <parav@mellanox.com>; RDMA mailing list <linux-rdma@vger.kernel.org>;
> Leon Romanovsky <leonro@mellanox.com>
> Subject: Re: [PATCH rdma-next] RDMA/core: Annotate destroy of mutex to
> ensure that it is released as unlocked
> 
> On Thu, Jul 04, 2019 at 04:00:12PM +0300, Leon Romanovsky wrote:
> > From: Parav Pandit <parav@mellanox.com>
> >
> > While compiled with CONFIG_DEBUG_MUTEXES, the kernel ensures that
> > mutex is not held during destroy.
> > Hence add mutex_destroy() for mutexes used in RDMA modules.
> >
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> >  drivers/infiniband/core/cache.c        | 1 +
> >  drivers/infiniband/core/cma_configfs.c | 1 +
> >  drivers/infiniband/core/device.c       | 3 +++
> >  drivers/infiniband/core/user_mad.c     | 2 +-
> >  drivers/infiniband/core/uverbs_main.c  | 2 ++
> >  drivers/infiniband/core/verbs.c        | 1 +
> >  6 files changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/infiniband/core/cache.c
> > b/drivers/infiniband/core/cache.c index 18e476b3ced0..00fb3eacda19
> > 100644
> > +++ b/drivers/infiniband/core/cache.c
> > @@ -810,6 +810,7 @@ static void release_gid_table(struct ib_device
> *device,
> >  	if (leak)
> >  		return;
> >
> > +	mutex_destroy(&table->lock);
> >  	kfree(table->data_vec);
> >  	kfree(table);
> >  }
> > diff --git a/drivers/infiniband/core/cma_configfs.c
> > b/drivers/infiniband/core/cma_configfs.c
> > index 3ec2c415bb70..0a7b5eba2fc0 100644
> > +++ b/drivers/infiniband/core/cma_configfs.c
> > @@ -350,4 +350,5 @@ int __init cma_configfs_init(void)  void __exit
> > cma_configfs_exit(void)  {
> >  	configfs_unregister_subsystem(&cma_subsys);
> > +	mutex_destroy(&cma_subsys.su_mutex);
> >  }
> 
> There is a missing mutex_destroy in cma_configfs_init's error path.
> 
Ack. Adding it.

> > diff --git a/drivers/infiniband/core/device.c
> > b/drivers/infiniband/core/device.c
> > index 7f4affe8a10d..adf8d93bb42d 100644
> > +++ b/drivers/infiniband/core/device.c
> > @@ -508,6 +508,9 @@ static void ib_device_release(struct device *device)
> >  			  rcu_head);
> >  	}
> >
> > +	mutex_destroy(&dev->unregistration_lock);
> > +	mutex_destroy(&dev->compat_devs_mutex);
> > +
> >  	xa_destroy(&dev->compat_devs);
> >  	xa_destroy(&dev->client_data);
> >  	kfree_rcu(dev, rcu_head);
> > diff --git a/drivers/infiniband/core/user_mad.c
> > b/drivers/infiniband/core/user_mad.c
> > index 9f8a48016b41..e0512aef033c 100644
> > +++ b/drivers/infiniband/core/user_mad.c
> > @@ -1038,7 +1038,7 @@ static int ib_umad_close(struct inode *inode, struct
> file *filp)
> >  				ib_unregister_mad_agent(file->agent[i]);
> >
> >  	mutex_unlock(&file->port->file_mutex);
> > -
> > +	mutex_destroy(&file->mutex);
> >  	kfree(file);
> 
> The file->port->file_mutex is missing a destroy in ib_umad_dev_free (bit tricky
> to do)
> 
I will leave this for now and address in second iteration.

> >  	return 0;
> >  }
> > diff --git a/drivers/infiniband/core/uverbs_main.c
> > b/drivers/infiniband/core/uverbs_main.c
> > index 11c13c1381cf..4827aa3415ff 100644
> > +++ b/drivers/infiniband/core/uverbs_main.c
> > @@ -120,6 +120,8 @@ static void ib_uverbs_release_dev(struct device
> > *device)
> >
> >  	uverbs_destroy_api(dev->uapi);
> >  	cleanup_srcu_struct(&dev->disassociate_srcu);
> > +	mutex_destroy(&dev->lists_mutex);
> > +	mutex_destroy(&dev->xrcd_tree_mutex);
> 
> This file also has ucontext_lock and umap_lock that are missing destroy
> 
Ack. Adding it.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox