From: Jason Gunthorpe <jgg@nvidia.com>
To: Leon Romanovsky <leon@kernel.org>
Cc: Doug Ledford <dledford@redhat.com>,
Leon Romanovsky <leonro@mellanox.com>,
Adit Ranadive <aditr@vmware.com>,
Ariel Elior <aelior@marvell.com>,
"Bernard Metzler" <bmt@zurich.ibm.com>,
Dennis Dalessandro <dennis.dalessandro@intel.com>,
Devesh Sharma <devesh.sharma@broadcom.com>,
Lijun Ou <oulijun@huawei.com>, <linux-rdma@vger.kernel.org>,
Michal Kalderon <mkalderon@marvell.com>,
Mike Marciniszyn <mike.marciniszyn@intel.com>,
Naresh Kumar PBS <nareshkumar.pbs@broadcom.com>,
Potnuri Bharat Teja <bharat@chelsio.com>,
Selvin Xavier <selvin.xavier@broadcom.com>,
"Somnath Kotur" <somnath.kotur@broadcom.com>,
Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>,
VMware PV-Drivers <pv-drivers@vmware.com>,
Weihang Li <liweihang@huawei.com>,
"Wei Hu(Xavier)" <huwei87@hisilicon.com>,
Yishai Hadas <yishaih@nvidia.com>,
Zhu Yanjun <yanjunz@nvidia.com>
Subject: Re: [PATCH rdma-next v1 05/10] RDMA: Restore ability to fail on SRQ destroy
Date: Wed, 2 Sep 2020 21:18:27 -0300 [thread overview]
Message-ID: <20200903001827.GB1479562@nvidia.com> (raw)
In-Reply-To: <20200830084010.102381-6-leon@kernel.org>
On Sun, Aug 30, 2020 at 11:40:05AM +0300, Leon Romanovsky wrote:
> -void mlx5_ib_destroy_srq(struct ib_srq *srq, struct ib_udata *udata)
> +int mlx5_ib_destroy_srq(struct ib_srq *srq, struct ib_udata *udata)
> {
> struct mlx5_ib_dev *dev = to_mdev(srq->device);
> struct mlx5_ib_srq *msrq = to_msrq(srq);
> + int ret;
> +
> + ret = mlx5_cmd_destroy_srq(dev, &msrq->msrq);
> + if (ret && udata)
> + return ret;
>
> - mlx5_cmd_destroy_srq(dev, &msrq->msrq);
> -
> - if (srq->uobject) {
> - mlx5_ib_db_unmap_user(
> - rdma_udata_to_drv_context(
> - udata,
> - struct mlx5_ib_ucontext,
> - ibucontext),
> - &msrq->db);
> - ib_umem_release(msrq->umem);
> - } else {
> - destroy_srq_kernel(dev, msrq);
> + if (udata) {
> + destroy_srq_user(srq->pd, msrq, udata);
> + return 0;
> }
> +
> + /* We are cleaning kernel resources anyway */
> + destroy_srq_kernel(dev, msrq);
Oh, and this isn't right.. If we are going to leak things then we have
to leak anything exposed for DMA as well, eg the fragbuf under the SRQ
has to be leaked.
If the HW can't guarentee it stopped doing DMA then we can't return
memory under potentially active DMA back to the system.
IHMO mlx5, and all the other drivers, get this wrong. Failing to
eventually destroy an object is a catastrophic failure of the
device. In the case of a kernel object it must always be destroyed on
the first attempt.
In this case the device should be killed. Disable memory access at the
PCI config space, trigger a device reset, disassociate the device, and
allow all destroy commands to fake-succeed.
Since drivers need help to get this right, I'm wonder if we should fix
this at the core level by introducing a 'your device is screwed up,
kill it' callback.
Then all the destroys can return failures as Gal wanted.
The core logic would be something like
ret = dev->ops.destroy_foo()
if (is_kernel_object())
dev->ops.device_is_broken()
ret = dev->ops.destroy_foo()
WARN_ON(ret);
Ie after 'device_is_broken' the driver must always succeed future
destroys.
Then we have a chance to make this work properly... mlx5 at least
already has an implementation of 'device_is_broken' that does trigger
success for future destroys.
Jason
next prev parent reply other threads:[~2020-09-03 0:19 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-30 8:40 [PATCH rdma-next v1 00/10] Restore failure of destroy commands Leon Romanovsky
2020-08-30 8:40 ` [PATCH rdma-next v1 01/10] RDMA: Restore ability to fail on PD deallocate Leon Romanovsky
2020-08-30 8:40 ` [PATCH rdma-next v1 02/10] RDMA: Restore ability to fail on AH destroy Leon Romanovsky
2020-08-30 8:40 ` [PATCH rdma-next v1 03/10] RDMA/mlx5: Issue FW command to destroy SRQ on reentry Leon Romanovsky
2020-09-03 0:31 ` Jason Gunthorpe
2020-09-03 5:08 ` Leon Romanovsky
2020-09-03 11:54 ` Jason Gunthorpe
2020-08-30 8:40 ` [PATCH rdma-next v1 04/10] RDMA/mlx5: Fix potential race between destroy and CQE poll Leon Romanovsky
2020-09-03 13:42 ` Jason Gunthorpe
2020-08-30 8:40 ` [PATCH rdma-next v1 05/10] RDMA: Restore ability to fail on SRQ destroy Leon Romanovsky
2020-09-03 0:08 ` Jason Gunthorpe
2020-09-03 5:11 ` Leon Romanovsky
2020-09-03 11:55 ` Jason Gunthorpe
2020-09-03 0:18 ` Jason Gunthorpe [this message]
2020-09-03 5:28 ` Leon Romanovsky
2020-09-03 12:22 ` Jason Gunthorpe
2020-09-03 13:12 ` Jason Gunthorpe
2020-08-30 8:40 ` [PATCH rdma-next v1 06/10] RDMA/core: Delete function indirection for alloc/free kernel CQ Leon Romanovsky
2020-09-03 0:20 ` Jason Gunthorpe
2020-09-03 5:35 ` Leon Romanovsky
2020-09-03 12:24 ` Jason Gunthorpe
2020-08-30 8:40 ` [PATCH rdma-next v1 07/10] RDMA: Allow fail of destroy CQ Leon Romanovsky
2020-08-30 8:40 ` [PATCH rdma-next v1 08/10] RDMA: Change XRCD destroy return value Leon Romanovsky
2020-08-30 8:40 ` [PATCH rdma-next v1 09/10] RDMA: Restore ability to return error for destroy WQ Leon Romanovsky
2020-08-30 8:40 ` [PATCH rdma-next v1 10/10] RDMA: Make counters destroy symmetrical Leon Romanovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200903001827.GB1479562@nvidia.com \
--to=jgg@nvidia.com \
--cc=aditr@vmware.com \
--cc=aelior@marvell.com \
--cc=bharat@chelsio.com \
--cc=bmt@zurich.ibm.com \
--cc=dennis.dalessandro@intel.com \
--cc=devesh.sharma@broadcom.com \
--cc=dledford@redhat.com \
--cc=huwei87@hisilicon.com \
--cc=leon@kernel.org \
--cc=leonro@mellanox.com \
--cc=linux-rdma@vger.kernel.org \
--cc=liweihang@huawei.com \
--cc=mike.marciniszyn@intel.com \
--cc=mkalderon@marvell.com \
--cc=nareshkumar.pbs@broadcom.com \
--cc=oulijun@huawei.com \
--cc=pv-drivers@vmware.com \
--cc=selvin.xavier@broadcom.com \
--cc=somnath.kotur@broadcom.com \
--cc=sriharsha.basavapatna@broadcom.com \
--cc=yanjunz@nvidia.com \
--cc=yishaih@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.