From: Leon Romanovsky <leon@kernel.org>
To: Sharath Srinivasan <sharath.srinivasan@oracle.com>
Cc: "Jason Gunthorpe" <jgg@nvidia.com>,
"Jack Morgenstein" <jackm@nvidia.com>,
"Feng Liu" <feliu@nvidia.com>,
"Håkon Bugge" <haakon.bugge@oracle.com>,
linux-rdma@vger.kernel.org,
"Patrisious Haddad" <phaddad@nvidia.com>,
"Vlad Dumitrescu" <vdumitrescu@nvidia.com>
Subject: Re: [PATCH rdma-rc] RDMA/cma: Fix hang when cma_netevent_callback fails to queue_work
Date: Thu, 22 May 2025 11:58:38 +0300 [thread overview]
Message-ID: <20250522085838.GO7435@unreal> (raw)
In-Reply-To: <0f005949-cf9b-403f-afcb-95be492a8e49@oracle.com>
On Wed, May 21, 2025 at 11:59:22AM -0700, Sharath Srinivasan wrote:
>
> On 2025-05-21 4:36 a.m., Leon Romanovsky wrote:
> > From: Jack Morgenstein <jackm@nvidia.com>
> >
> > The cited commit fixed a crash when cma_netevent_callback was called for
> > a cma_id while work on that id from a previous call had not yet started.
> > The work item was re-initialized in the second call, which corrupted the
> > work item currently in the work queue.
> >
> > However, it left a problem when queue_work fails (because the item is
> > still pending in the work queue from a previous call). In this case,
> > cma_id_put (which is called in the work handler) is therefore not
> > called. This results in a userspace process hang (zombie process).
> >
> > Fix this by calling cma_id_put() if queue_work fails.
> >
> > Fixes: 45f5dcdd0497 ("RDMA/cma: Fix workqueue crash in cma_netevent_work_handler")
>
> IMO the above Fixes: tag should point to the commit that introduced the line:
> "queue_work(cma_wq, ¤t_id->id.net_work);"
>
> i.e. Fixes: 925d046e7e52 ("RDMA/core: Add a netevent notifier to cma")
>
> and not another bug fix (45f5dcdd0497) which did not introduce the problem being described in this patch (a missing cma_id_put() when queue_work() fails).
It is not, according to the queue_work() description and implementation,
that function call can fail only if this work already exist. Before commit 45f5dcdd0497
that cma_netevent_work was always new and hence can't fail. This is why queue_work()
returned value is almost never checked in the kernel.
Thanks
>
> Otherwise the fix looks good to me:
> Reviewed-by: Sharath Srinivasan <sharath.srinivasan@oracle.com>
>
> Thanks,
> Sharath
>
> > Signed-off-by: Jack Morgenstein <jackm@nvidia.com>
> > Signed-off-by: Feng Liu <feliu@nvidia.com>
> > Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> > drivers/infiniband/core/cma.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> > index ab31eefa916b3..274cfbd5aaba7 100644
> > --- a/drivers/infiniband/core/cma.c
> > +++ b/drivers/infiniband/core/cma.c
> > @@ -5245,7 +5245,8 @@ static int cma_netevent_callback(struct notifier_block *self,
> > neigh->ha, ETH_ALEN))
> > continue;
> > cma_id_get(current_id);
> > - queue_work(cma_wq, ¤t_id->id.net_work);
> > + if (!queue_work(cma_wq, ¤t_id->id.net_work))
> > + cma_id_put(current_id);
> > }
> > out:
> > spin_unlock_irqrestore(&id_table_lock, flags);
>
next prev parent reply other threads:[~2025-05-22 8:58 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-21 11:36 [PATCH rdma-rc] RDMA/cma: Fix hang when cma_netevent_callback fails to queue_work Leon Romanovsky
2025-05-21 18:59 ` Sharath Srinivasan
2025-05-22 8:58 ` Leon Romanovsky [this message]
2025-05-22 16:54 ` Sharath Srinivasan
2025-05-22 3:51 ` Kalesh Anakkur Purayil
2025-05-26 18:45 ` Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250522085838.GO7435@unreal \
--to=leon@kernel.org \
--cc=feliu@nvidia.com \
--cc=haakon.bugge@oracle.com \
--cc=jackm@nvidia.com \
--cc=jgg@nvidia.com \
--cc=linux-rdma@vger.kernel.org \
--cc=phaddad@nvidia.com \
--cc=sharath.srinivasan@oracle.com \
--cc=vdumitrescu@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.