All of lore.kernel.org
 help / color / mirror / Atom feed
From: Leon Romanovsky <leon@kernel.org>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Patrisious Haddad <phaddad@nvidia.com>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	linux-rdma@vger.kernel.org, netdev@vger.kernel.org,
	Paolo Abeni <pabeni@redhat.com>,
	Saeed Mahameed <saeedm@nvidia.com>
Subject: Re: [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure
Date: Tue, 21 Mar 2023 14:43:57 +0200	[thread overview]
Message-ID: <20230321124357.GU36557@unreal> (raw)
In-Reply-To: <ZBmlBEldcG6rMcM1@nvidia.com>

On Tue, Mar 21, 2023 at 09:37:24AM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 21, 2023 at 02:02:59PM +0200, Leon Romanovsky wrote:
> > On Tue, Mar 21, 2023 at 08:53:35AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Mar 21, 2023 at 09:54:58AM +0200, Leon Romanovsky wrote:
> > > > On Mon, Mar 20, 2023 at 04:18:14PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Mar 16, 2023 at 03:39:27PM +0200, Leon Romanovsky wrote:
> > > > > > From: Patrisious Haddad <phaddad@nvidia.com>
> > > > > > 
> > > > > > Previously when destroying a DCT, if the firmware function for the
> > > > > > destruction failed, the common resource would have been destroyed
> > > > > > either way, since it was destroyed before the firmware object.
> > > > > > Which leads to kernel warning "refcount_t: underflow" which indicates
> > > > > > possible use-after-free.
> > > > > > Which is triggered when we try to destroy the common resource for the
> > > > > > second time and execute refcount_dec_and_test(&common->refcount).
> > > > > > 
> > > > > > So, currently before destroying the common resource we check its
> > > > > > refcount and continue with the destruction only if it isn't zero.
> > > > > 
> > > > > This seems super sketchy
> > > > > 
> > > > > If the destruction fails why not set the refcount back to 1?
> > > > 
> > > > Because destruction will fail in destroy_rq_tracked() which is after
> > > > destroy_resource_common().
> > > > 
> > > > In first destruction attempt, we delete qp from radix tree and wait for all
> > > > reference to drop. In order do not undo all this logic (setting 1 alone is
> > > > not enough), it is much safer simply skip destroy_resource_common() in reentry
> > > > case.
> > > 
> > > This is the bug I pointed a long time ago, it is ordered wrong to
> > > remove restrack before destruction is assured
> > 
> > It is not restrack, but internal to mlx5_core structure.
> > 
> >   176 static void destroy_resource_common(struct mlx5_ib_dev *dev,
> >   177                                     struct mlx5_core_qp *qp)
> >   178 {
> >   179         struct mlx5_qp_table *table = &dev->qp_table;
> >   180         unsigned long flags;
> >   181
> > 
> > ....
> > 
> >   185         spin_lock_irqsave(&table->lock, flags);
> >   186         radix_tree_delete(&table->tree,
> >   187                           qp->qpn | (qp->common.res << MLX5_USER_INDEX_LEN));
> >   188         spin_unlock_irqrestore(&table->lock, flags);
> >   189         mlx5_core_put_rsc((struct mlx5_core_rsc_common *)qp);
> >   190         wait_for_completion(&qp->common.free);
> >   191 }
> 
> Same basic issue.
> 
> "RSC"'s refcount stuff is really only for ODP to use, and the silly
> pseudo locking should really just be rwsem not a refcount.
> 
> Get DCT out of that particular mess and the scheme is quite simple and
> doesn't nee hacky stuff.
> 
> Please make a patch to remove radix tree from this code too...

ok, I'll take a look.

Thanks

  reply	other threads:[~2023-03-21 12:44 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-16 13:39 [PATCH rdma-next v1 0/3] Handle FW failures to destroy QP/RQ objects Leon Romanovsky
2023-03-16 13:39 ` [PATCH mlx5-next v1 1/3] net/mlx5: Nullify qp->dbg pointer post destruction Leon Romanovsky
2023-03-16 13:39 ` [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure Leon Romanovsky
2023-03-20 19:18   ` Jason Gunthorpe
2023-03-21  7:54     ` Leon Romanovsky
2023-03-21 11:53       ` Jason Gunthorpe
2023-03-21 12:02         ` Leon Romanovsky
2023-03-21 12:37           ` Jason Gunthorpe
2023-03-21 12:43             ` Leon Romanovsky [this message]
2023-03-16 13:39 ` [PATCH rdma-next v1 3/3] RDMA/mlx5: Return the firmware result upon destroying QP/RQ Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230321124357.GU36557@unreal \
    --to=leon@kernel.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=jgg@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=phaddad@nvidia.com \
    --cc=saeedm@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.