netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Leon Romanovsky <leon@kernel.org>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Patrisious Haddad <phaddad@nvidia.com>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	linux-rdma@vger.kernel.org, netdev@vger.kernel.org,
	Paolo Abeni <pabeni@redhat.com>,
	Saeed Mahameed <saeedm@nvidia.com>
Subject: Re: [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure
Date: Tue, 21 Mar 2023 14:43:57 +0200	[thread overview]
Message-ID: <20230321124357.GU36557@unreal> (raw)
In-Reply-To: <ZBmlBEldcG6rMcM1@nvidia.com>

On Tue, Mar 21, 2023 at 09:37:24AM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 21, 2023 at 02:02:59PM +0200, Leon Romanovsky wrote:
> > On Tue, Mar 21, 2023 at 08:53:35AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Mar 21, 2023 at 09:54:58AM +0200, Leon Romanovsky wrote:
> > > > On Mon, Mar 20, 2023 at 04:18:14PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Mar 16, 2023 at 03:39:27PM +0200, Leon Romanovsky wrote:
> > > > > > From: Patrisious Haddad <phaddad@nvidia.com>
> > > > > > 
> > > > > > Previously when destroying a DCT, if the firmware function for the
> > > > > > destruction failed, the common resource would have been destroyed
> > > > > > either way, since it was destroyed before the firmware object.
> > > > > > Which leads to kernel warning "refcount_t: underflow" which indicates
> > > > > > possible use-after-free.
> > > > > > Which is triggered when we try to destroy the common resource for the
> > > > > > second time and execute refcount_dec_and_test(&common->refcount).
> > > > > > 
> > > > > > So, currently before destroying the common resource we check its
> > > > > > refcount and continue with the destruction only if it isn't zero.
> > > > > 
> > > > > This seems super sketchy
> > > > > 
> > > > > If the destruction fails why not set the refcount back to 1?
> > > > 
> > > > Because destruction will fail in destroy_rq_tracked() which is after
> > > > destroy_resource_common().
> > > > 
> > > > In first destruction attempt, we delete qp from radix tree and wait for all
> > > > reference to drop. In order do not undo all this logic (setting 1 alone is
> > > > not enough), it is much safer simply skip destroy_resource_common() in reentry
> > > > case.
> > > 
> > > This is the bug I pointed a long time ago, it is ordered wrong to
> > > remove restrack before destruction is assured
> > 
> > It is not restrack, but internal to mlx5_core structure.
> > 
> >   176 static void destroy_resource_common(struct mlx5_ib_dev *dev,
> >   177                                     struct mlx5_core_qp *qp)
> >   178 {
> >   179         struct mlx5_qp_table *table = &dev->qp_table;
> >   180         unsigned long flags;
> >   181
> > 
> > ....
> > 
> >   185         spin_lock_irqsave(&table->lock, flags);
> >   186         radix_tree_delete(&table->tree,
> >   187                           qp->qpn | (qp->common.res << MLX5_USER_INDEX_LEN));
> >   188         spin_unlock_irqrestore(&table->lock, flags);
> >   189         mlx5_core_put_rsc((struct mlx5_core_rsc_common *)qp);
> >   190         wait_for_completion(&qp->common.free);
> >   191 }
> 
> Same basic issue.
> 
> "RSC"'s refcount stuff is really only for ODP to use, and the silly
> pseudo locking should really just be rwsem not a refcount.
> 
> Get DCT out of that particular mess and the scheme is quite simple and
> doesn't nee hacky stuff.
> 
> Please make a patch to remove radix tree from this code too...

ok, I'll take a look.

Thanks

  reply	other threads:[~2023-03-21 12:45 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-16 13:39 [PATCH rdma-next v1 0/3] Handle FW failures to destroy QP/RQ objects Leon Romanovsky
2023-03-16 13:39 ` [PATCH mlx5-next v1 1/3] net/mlx5: Nullify qp->dbg pointer post destruction Leon Romanovsky
2023-03-16 13:39 ` [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure Leon Romanovsky
2023-03-20 19:18   ` Jason Gunthorpe
2023-03-21  7:54     ` Leon Romanovsky
2023-03-21 11:53       ` Jason Gunthorpe
2023-03-21 12:02         ` Leon Romanovsky
2023-03-21 12:37           ` Jason Gunthorpe
2023-03-21 12:43             ` Leon Romanovsky [this message]
2023-03-16 13:39 ` [PATCH rdma-next v1 3/3] RDMA/mlx5: Return the firmware result upon destroying QP/RQ Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230321124357.GU36557@unreal \
    --to=leon@kernel.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=jgg@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=phaddad@nvidia.com \
    --cc=saeedm@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).