linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Leon Romanovsky <leon@kernel.org>
Cc: Doug Ledford <dledford@redhat.com>,
	Leon Romanovsky <leonro@mellanox.com>,
	<linux-rdma@vger.kernel.org>
Subject: Re: [PATCH rdma-next v1 03/10] RDMA/mlx5: Issue FW command to destroy SRQ on reentry
Date: Wed, 2 Sep 2020 21:31:15 -0300	[thread overview]
Message-ID: <20200903003115.GA1480685@nvidia.com> (raw)
In-Reply-To: <20200830084010.102381-4-leon@kernel.org>

On Sun, Aug 30, 2020 at 11:40:03AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@mellanox.com>
> 
> The HW release can fail and leave the system in limbo state,
> where SRQ is removed from the table, but can't be destroyed later.
> In every reentry, the initial xa_erase_irq() check will fail.
> 
> Rewrite the erase logic to keep index, but don't store the entry
> itself. By doing it, we can safely reinsert entry back in the case
> of destroy failure and be safe from any xa_store_irq() error.
> 
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
>  drivers/infiniband/hw/mlx5/srq_cmd.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mlx5/srq_cmd.c b/drivers/infiniband/hw/mlx5/srq_cmd.c
> index 37aaacebd3f2..c6d807f04d9d 100644
> +++ b/drivers/infiniband/hw/mlx5/srq_cmd.c
> @@ -596,13 +596,22 @@ void mlx5_cmd_destroy_srq(struct mlx5_ib_dev *dev, struct mlx5_core_srq *srq)
>  	struct mlx5_core_srq *tmp;
>  	int err;
>  
> -	tmp = xa_erase_irq(&table->array, srq->srqn);
> -	if (!tmp || tmp != srq)
> +	/* Delete entry, but leave index occupied */
> +	tmp = xa_store_irq(&table->array, srq->srqn, NULL, 0);
> +	if (WARN_ON(!tmp || tmp != srq))
>  		return;

This isn't an allocating xarray:

	xa_init_flags(&table->array, XA_FLAGS_LOCK_IRQ);

So storing NULL actually does delete the entry and clean up the memory
and the store below could fail.

I think this should be written as

   xa_cmpxchg_irq(&table->array, srq->srqn, srq, XA_ZERO_ENTRY, 0);

And the undo below would be

   xa_cmpxchg_irq(&table->array, srq->srqn, XA_ZERO_ENTRY, srq 0);

> +	xa_erase_irq(&table->array, srq->srqn);

And this is racy since the FW could have reallocated the same srqn and
already set it in the xarray.

It needs to be xa_release_irq(), which looks like it needs to be
added to match xa_reserve_irq()

Jason

  reply	other threads:[~2020-09-03  0:31 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-30  8:40 [PATCH rdma-next v1 00/10] Restore failure of destroy commands Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 01/10] RDMA: Restore ability to fail on PD deallocate Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 02/10] RDMA: Restore ability to fail on AH destroy Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 03/10] RDMA/mlx5: Issue FW command to destroy SRQ on reentry Leon Romanovsky
2020-09-03  0:31   ` Jason Gunthorpe [this message]
2020-09-03  5:08     ` Leon Romanovsky
2020-09-03 11:54       ` Jason Gunthorpe
2020-08-30  8:40 ` [PATCH rdma-next v1 04/10] RDMA/mlx5: Fix potential race between destroy and CQE poll Leon Romanovsky
2020-09-03 13:42   ` Jason Gunthorpe
2020-08-30  8:40 ` [PATCH rdma-next v1 05/10] RDMA: Restore ability to fail on SRQ destroy Leon Romanovsky
2020-09-03  0:08   ` Jason Gunthorpe
2020-09-03  5:11     ` Leon Romanovsky
2020-09-03 11:55       ` Jason Gunthorpe
2020-09-03  0:18   ` Jason Gunthorpe
2020-09-03  5:28     ` Leon Romanovsky
2020-09-03 12:22       ` Jason Gunthorpe
2020-09-03 13:12         ` Jason Gunthorpe
2020-08-30  8:40 ` [PATCH rdma-next v1 06/10] RDMA/core: Delete function indirection for alloc/free kernel CQ Leon Romanovsky
2020-09-03  0:20   ` Jason Gunthorpe
2020-09-03  5:35     ` Leon Romanovsky
2020-09-03 12:24       ` Jason Gunthorpe
2020-08-30  8:40 ` [PATCH rdma-next v1 07/10] RDMA: Allow fail of destroy CQ Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 08/10] RDMA: Change XRCD destroy return value Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 09/10] RDMA: Restore ability to return error for destroy WQ Leon Romanovsky
2020-08-30  8:40 ` [PATCH rdma-next v1 10/10] RDMA: Make counters destroy symmetrical Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200903003115.GA1480685@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=dledford@redhat.com \
    --cc=leon@kernel.org \
    --cc=leonro@mellanox.com \
    --cc=linux-rdma@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).