linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Leon Romanovsky <leon@kernel.org>
Cc: Doug Ledford <dledford@redhat.com>,
	Adit Ranadive <aditr@vmware.com>,
	Ariel Elior <aelior@marvell.com>,
	Bernard Metzler <bmt@zurich.ibm.com>,
	Dennis Dalessandro <dennis.dalessandro@intel.com>,
	Devesh Sharma <devesh.sharma@broadcom.com>,
	Lijun Ou <oulijun@huawei.com>, <linux-rdma@vger.kernel.org>,
	Michal Kalderon <mkalderon@marvell.com>,
	"Mike Marciniszyn" <mike.marciniszyn@intel.com>,
	Naresh Kumar PBS <nareshkumar.pbs@broadcom.com>,
	Potnuri Bharat Teja <bharat@chelsio.com>,
	Selvin Xavier <selvin.xavier@broadcom.com>,
	Somnath Kotur <somnath.kotur@broadcom.com>,
	Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>,
	VMware PV-Drivers <pv-drivers@vmware.com>,
	Weihang Li <liweihang@huawei.com>,
	"Wei Hu(Xavier)" <huwei87@hisilicon.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	Zhu Yanjun <yanjunz@nvidia.com>
Subject: Re: [PATCH rdma-next v1 05/10] RDMA: Restore ability to fail on SRQ destroy
Date: Thu, 3 Sep 2020 10:12:01 -0300	[thread overview]
Message-ID: <20200903131201.GD1152540@nvidia.com> (raw)
In-Reply-To: <20200903122256.GA1152540@nvidia.com>

On Thu, Sep 03, 2020 at 09:22:56AM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 03, 2020 at 08:28:26AM +0300, Leon Romanovsky wrote:
> > On Wed, Sep 02, 2020 at 09:18:27PM -0300, Jason Gunthorpe wrote:
> > > On Sun, Aug 30, 2020 at 11:40:05AM +0300, Leon Romanovsky wrote:
> > >
> > > > -void mlx5_ib_destroy_srq(struct ib_srq *srq, struct ib_udata *udata)
> > > > +int mlx5_ib_destroy_srq(struct ib_srq *srq, struct ib_udata *udata)
> > > >  {
> > > >  	struct mlx5_ib_dev *dev = to_mdev(srq->device);
> > > >  	struct mlx5_ib_srq *msrq = to_msrq(srq);
> > > > +	int ret;
> > > > +
> > > > +	ret = mlx5_cmd_destroy_srq(dev, &msrq->msrq);
> > > > +	if (ret && udata)
> > > > +		return ret;
> > > >
> > > > -	mlx5_cmd_destroy_srq(dev, &msrq->msrq);
> > > > -
> > > > -	if (srq->uobject) {
> > > > -		mlx5_ib_db_unmap_user(
> > > > -			rdma_udata_to_drv_context(
> > > > -				udata,
> > > > -				struct mlx5_ib_ucontext,
> > > > -				ibucontext),
> > > > -			&msrq->db);
> > > > -		ib_umem_release(msrq->umem);
> > > > -	} else {
> > > > -		destroy_srq_kernel(dev, msrq);
> > > > +	if (udata) {
> > > > +		destroy_srq_user(srq->pd, msrq, udata);
> > > > +		return 0;
> > > >  	}
> > > > +
> > > > +	/* We are cleaning kernel resources anyway */
> > > > +	destroy_srq_kernel(dev, msrq);
> > >
> > > Oh, and this isn't right.. If we are going to leak things then we have
> > > to leak anything exposed for DMA as well, eg the fragbuf under the SRQ
> > > has to be leaked.
> > 
> > We are leaking for ULPs only, from their perspective everything was
> > released and WARN_ON() will be the sign of the problem.
> 
> If we are going to add back in error handling, then it needs to be
> done right, there is no different between kernel and user, everything
> should be leaked.
> 
> > > If the HW can't guarentee it stopped doing DMA then we can't return
> > > memory under potentially active DMA back to the system.
> > 
> > ULPs are supposed to guarantee that all operations stopped.
> 
> ULP should never trigger this, only broken HW can cause this kind of
> problem.
> 
> > I don't know, all those years we relied on the ULPs to do destroy
> > properly and it worked well. I didn't hear any complain from the field
> > that HW destroy failed in proper ULP flow.
> > 
> > It looks to me over-engineering.
> 
> Given mlx5 already has the fatal error handling it seems a reasonable
> way to re-introduce the error code without just delcaring drivers are
> buggy to use it..

That said, it doesn't have to be done in this series, however lets
just keep the basic principle that if destroy fails then the HW object
remains untouched and fully operational. None of this partial destroy
only if in kernel mode stuff.

Jason

  reply	other threads:[~2020-09-03 14:42 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-30  8:40 [PATCH rdma-next v1 00/10] Restore failure of destroy commands Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 01/10] RDMA: Restore ability to fail on PD deallocate Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 02/10] RDMA: Restore ability to fail on AH destroy Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 03/10] RDMA/mlx5: Issue FW command to destroy SRQ on reentry Leon Romanovsky
2020-09-03  0:31   ` Jason Gunthorpe
2020-09-03  5:08     ` Leon Romanovsky
2020-09-03 11:54       ` Jason Gunthorpe
2020-08-30  8:40 ` [PATCH rdma-next v1 04/10] RDMA/mlx5: Fix potential race between destroy and CQE poll Leon Romanovsky
2020-09-03 13:42   ` Jason Gunthorpe
2020-08-30  8:40 ` [PATCH rdma-next v1 05/10] RDMA: Restore ability to fail on SRQ destroy Leon Romanovsky
2020-09-03  0:08   ` Jason Gunthorpe
2020-09-03  5:11     ` Leon Romanovsky
2020-09-03 11:55       ` Jason Gunthorpe
2020-09-03  0:18   ` Jason Gunthorpe
2020-09-03  5:28     ` Leon Romanovsky
2020-09-03 12:22       ` Jason Gunthorpe
2020-09-03 13:12         ` Jason Gunthorpe [this message]
2020-08-30  8:40 ` [PATCH rdma-next v1 06/10] RDMA/core: Delete function indirection for alloc/free kernel CQ Leon Romanovsky
2020-09-03  0:20   ` Jason Gunthorpe
2020-09-03  5:35     ` Leon Romanovsky
2020-09-03 12:24       ` Jason Gunthorpe
2020-08-30  8:40 ` [PATCH rdma-next v1 07/10] RDMA: Allow fail of destroy CQ Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 08/10] RDMA: Change XRCD destroy return value Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 09/10] RDMA: Restore ability to return error for destroy WQ Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 10/10] RDMA: Make counters destroy symmetrical Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200903131201.GD1152540@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=aditr@vmware.com \
    --cc=aelior@marvell.com \
    --cc=bharat@chelsio.com \
    --cc=bmt@zurich.ibm.com \
    --cc=dennis.dalessandro@intel.com \
    --cc=devesh.sharma@broadcom.com \
    --cc=dledford@redhat.com \
    --cc=huwei87@hisilicon.com \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=liweihang@huawei.com \
    --cc=mike.marciniszyn@intel.com \
    --cc=mkalderon@marvell.com \
    --cc=nareshkumar.pbs@broadcom.com \
    --cc=oulijun@huawei.com \
    --cc=pv-drivers@vmware.com \
    --cc=selvin.xavier@broadcom.com \
    --cc=somnath.kotur@broadcom.com \
    --cc=sriharsha.basavapatna@broadcom.com \
    --cc=yanjunz@nvidia.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).