From: Jason Gunthorpe <jgg@nvidia.com>
To: Leon Romanovsky <leon@kernel.org>
Cc: Doug Ledford <dledford@redhat.com>,
Leon Romanovsky <leonro@mellanox.com>,
<linux-rdma@vger.kernel.org>
Subject: Re: [PATCH rdma-next v1 03/10] RDMA/mlx5: Issue FW command to destroy SRQ on reentry
Date: Wed, 2 Sep 2020 21:31:15 -0300 [thread overview]
Message-ID: <20200903003115.GA1480685@nvidia.com> (raw)
In-Reply-To: <20200830084010.102381-4-leon@kernel.org>
On Sun, Aug 30, 2020 at 11:40:03AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@mellanox.com>
>
> The HW release can fail and leave the system in limbo state,
> where SRQ is removed from the table, but can't be destroyed later.
> In every reentry, the initial xa_erase_irq() check will fail.
>
> Rewrite the erase logic to keep index, but don't store the entry
> itself. By doing it, we can safely reinsert entry back in the case
> of destroy failure and be safe from any xa_store_irq() error.
>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> drivers/infiniband/hw/mlx5/srq_cmd.c | 15 ++++++++++++---
> 1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/infiniband/hw/mlx5/srq_cmd.c b/drivers/infiniband/hw/mlx5/srq_cmd.c
> index 37aaacebd3f2..c6d807f04d9d 100644
> +++ b/drivers/infiniband/hw/mlx5/srq_cmd.c
> @@ -596,13 +596,22 @@ void mlx5_cmd_destroy_srq(struct mlx5_ib_dev *dev, struct mlx5_core_srq *srq)
> struct mlx5_core_srq *tmp;
> int err;
>
> - tmp = xa_erase_irq(&table->array, srq->srqn);
> - if (!tmp || tmp != srq)
> + /* Delete entry, but leave index occupied */
> + tmp = xa_store_irq(&table->array, srq->srqn, NULL, 0);
> + if (WARN_ON(!tmp || tmp != srq))
> return;
This isn't an allocating xarray:
xa_init_flags(&table->array, XA_FLAGS_LOCK_IRQ);
So storing NULL actually does delete the entry and clean up the memory
and the store below could fail.
I think this should be written as
xa_cmpxchg_irq(&table->array, srq->srqn, srq, XA_ZERO_ENTRY, 0);
And the undo below would be
xa_cmpxchg_irq(&table->array, srq->srqn, XA_ZERO_ENTRY, srq 0);
> + xa_erase_irq(&table->array, srq->srqn);
And this is racy since the FW could have reallocated the same srqn and
already set it in the xarray.
It needs to be xa_release_irq(), which looks like it needs to be
added to match xa_reserve_irq()
Jason
next prev parent reply other threads:[~2020-09-03 0:31 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-30 8:40 [PATCH rdma-next v1 00/10] Restore failure of destroy commands Leon Romanovsky
2020-08-30 8:40 ` [PATCH rdma-next v1 01/10] RDMA: Restore ability to fail on PD deallocate Leon Romanovsky
2020-08-30 8:40 ` [PATCH rdma-next v1 02/10] RDMA: Restore ability to fail on AH destroy Leon Romanovsky
2020-08-30 8:40 ` [PATCH rdma-next v1 03/10] RDMA/mlx5: Issue FW command to destroy SRQ on reentry Leon Romanovsky
2020-09-03 0:31 ` Jason Gunthorpe [this message]
2020-09-03 5:08 ` Leon Romanovsky
2020-09-03 11:54 ` Jason Gunthorpe
2020-08-30 8:40 ` [PATCH rdma-next v1 04/10] RDMA/mlx5: Fix potential race between destroy and CQE poll Leon Romanovsky
2020-09-03 13:42 ` Jason Gunthorpe
2020-08-30 8:40 ` [PATCH rdma-next v1 05/10] RDMA: Restore ability to fail on SRQ destroy Leon Romanovsky
2020-09-03 0:08 ` Jason Gunthorpe
2020-09-03 5:11 ` Leon Romanovsky
2020-09-03 11:55 ` Jason Gunthorpe
2020-09-03 0:18 ` Jason Gunthorpe
2020-09-03 5:28 ` Leon Romanovsky
2020-09-03 12:22 ` Jason Gunthorpe
2020-09-03 13:12 ` Jason Gunthorpe
2020-08-30 8:40 ` [PATCH rdma-next v1 06/10] RDMA/core: Delete function indirection for alloc/free kernel CQ Leon Romanovsky
2020-09-03 0:20 ` Jason Gunthorpe
2020-09-03 5:35 ` Leon Romanovsky
2020-09-03 12:24 ` Jason Gunthorpe
2020-08-30 8:40 ` [PATCH rdma-next v1 07/10] RDMA: Allow fail of destroy CQ Leon Romanovsky
2020-08-30 8:40 ` [PATCH rdma-next v1 08/10] RDMA: Change XRCD destroy return value Leon Romanovsky
2020-08-30 8:40 ` [PATCH rdma-next v1 09/10] RDMA: Restore ability to return error for destroy WQ Leon Romanovsky
2020-08-30 8:40 ` [PATCH rdma-next v1 10/10] RDMA: Make counters destroy symmetrical Leon Romanovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200903003115.GA1480685@nvidia.com \
--to=jgg@nvidia.com \
--cc=dledford@redhat.com \
--cc=leon@kernel.org \
--cc=leonro@mellanox.com \
--cc=linux-rdma@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).