From: Ming Lei <ming.lei@redhat.com>
To: Nilay Shroff <nilay@linux.ibm.com>
Cc: linux-nvme@lists.infradead.org, kbusch@kernel.org, hch@lst.de,
sagi@grimberg.me, axboe@fb.com, chaitanyak@nvidia.com,
dlemoal@kernel.org, gjoyce@linux.ibm.com
Subject: Re: [PATCH 2/3] nvme-fabrics: fix kernel crash while shutting down controller
Date: Wed, 30 Oct 2024 10:20:34 +0800 [thread overview]
Message-ID: <ZyGX8jCZVXwbhI89@fedora> (raw)
In-Reply-To: <ff8baedc-3aa9-4277-8753-282f3744ae2b@linux.ibm.com>
On Tue, Oct 29, 2024 at 06:10:00PM +0530, Nilay Shroff wrote:
>
>
> On 10/29/24 12:53, Ming Lei wrote:
> > On Sun, Oct 27, 2024 at 10:32:05PM +0530, Nilay Shroff wrote:
> >> The nvme keep-alive operation, which executes at a periodic interval,
> >> could potentially sneak in while shutting down a fabric controller.
> >> This may lead to a race between the fabric controller admin queue
> >> destroy code path (invoked while shutting down controller) and hw/hctx
> >> queue dispatcher called from the nvme keep-alive async request queuing
> >> operation. This race could lead to the kernel crash shown below:
> >>
> >> Call Trace:
> >> autoremove_wake_function+0x0/0xbc (unreliable)
> >> __blk_mq_sched_dispatch_requests+0x114/0x24c
> >> blk_mq_sched_dispatch_requests+0x44/0x84
> >> blk_mq_run_hw_queue+0x140/0x220
> >> nvme_keep_alive_work+0xc8/0x19c [nvme_core]
> >> process_one_work+0x200/0x4e0
> >> worker_thread+0x340/0x504
> >> kthread+0x138/0x140
> >> start_kernel_thread+0x14/0x18
> >>
> >> While shutting down fabric controller, if nvme keep-alive request sneaks
> >> in then it would be flushed off. The nvme_keep_alive_end_io function is
> >> then invoked to handle the end of the keep-alive operation which
> >> decrements the admin->q_usage_counter and assuming this is the last/only
> >> request in the admin queue then the admin->q_usage_counter becomes zero.
> >> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
> >> queue()) which could be potentially running simultaneously on another
> >> cpu (as this is the controller shutdown code path) would forward
> >> progress and deletes the admin queue. So, now from this point onward
> >> we are not supposed to access the admin queue resources. However the
> >> issue here's that the nvme keep-alive thread running hw/hctx queue
> >> dispatch operation hasn't yet finished its work and so it could still
> >> potentially access the admin queue resource while the admin queue had
> >> been already deleted and that causes the above crash.
> >>
> >> The above kernel crash is regression caused due to changes implemented
> >> in commit a54a93d0e359 ("nvme: move stopping keep-alive into
> >> nvme_uninit_ctrl()"). Ideally we should stop keep-alive at the very
> >> beggining of the controller shutdown code path so that it wouldn't
> >> sneak in during the shutdown operation. However we removed the keep
> >> alive stop operation from the beginning of the controller shutdown
> >> code path in commit a54a93d0e359 ("nvme: move stopping keep-alive into
> >> nvme_uninit_ctrl()") and that now created the possibility of keep-alive
> >> sneaking in and interfering with the shutdown operation and causing
> >> observed kernel crash. So to fix this crash, now we're adding back the
> >> keep-alive stop operation at very beginning of the fabric controller
> >> shutdown code path so that the actual controller shutdown opeation only
> >> begins after it's ensured that keep-alive operation is not in-flight and
> >> also it can't be scheduled in future.
> >>
> >> Fixes: a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
> >> Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/#t
> >> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> >> ---
> >> drivers/nvme/host/core.c | 5 +++++
> >> 1 file changed, 5 insertions(+)
> >>
> >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >> index 5016f69e9a15..865c00ea19e3 100644
> >> --- a/drivers/nvme/host/core.c
> >> +++ b/drivers/nvme/host/core.c
> >> @@ -4648,6 +4648,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
> >> {
> >> nvme_mpath_stop(ctrl);
> >> nvme_auth_stop(ctrl);
> >> + /*
> >> + * the transport driver may be terminating the admin tagset a little
> >> + * later on, so we cannot have the keep-alive work running
> >> + */
> >> + nvme_stop_keep_alive(ctrl);
> >> nvme_stop_failfast_work(ctrl);
> >> flush_work(&ctrl->async_event_work);
> >> cancel_work_sync(&ctrl->fw_act_work);
> >
> > The change looks fine.
> >
> > IMO the `nvme_stop_keep_alive` in nvme_uninit_ctrl() may be moved to
> > entry of nvme_remove_admin_tag_set(), then this one in nvme_stop_ctrl()
> > can be saved?
> >
> Yes that should work however IMO, stopping keep-alive at very beginning of
> shutdown operation would make sense because delaying the stopping of keep-alive
> would not be useful anyways once we start the controller shutdown. It may
> sneak in unnecessarily while we shutdown controller and later we will have to
> flush it off.
>
> And yes, as you mentioned, in this case we would save one call site but
> looking at the code we have few other call sites already present where we
> call nvme_stop_keep_alive().
If it works, I'd suggest to move nvme_stop_keep_alive() from
nvme_uninit_ctrl() into nvme_remove_admin_tag_set() because:
1) it isn't correct to do it in nvme_uninit_ctrl()
2) stop keep alive in nvme_remove_admin_tag_set() covers everything
3) most of nvme_stop_keep_alive() can be removed in failure path
Thanks,
Ming
next prev parent reply other threads:[~2024-10-30 2:20 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-27 17:02 [PATCH 0/3] nvme: fix system fault observed while shutting down controller Nilay Shroff
2024-10-27 17:02 ` [PATCH 1/3] Revert "nvme: make keep-alive synchronous operation" Nilay Shroff
2024-10-27 22:05 ` Sagi Grimberg
2024-10-29 6:48 ` Ming Lei
2024-10-29 12:46 ` Nilay Shroff
2024-10-27 17:02 ` [PATCH 2/3] nvme-fabrics: fix kernel crash while shutting down controller Nilay Shroff
2024-10-27 22:05 ` Sagi Grimberg
2024-10-29 7:23 ` Ming Lei
2024-10-29 12:40 ` Nilay Shroff
2024-10-30 2:20 ` Ming Lei [this message]
2024-10-30 10:38 ` Nilay Shroff
2024-10-30 12:51 ` Ming Lei
2024-10-31 10:10 ` Nilay Shroff
2024-10-27 17:02 ` [PATCH 3/3] nvme: use helper nvme_ctrl_state in nvme_keep_alive_end_io function Nilay Shroff
2024-10-27 22:07 ` Sagi Grimberg
2024-10-28 4:43 ` Nilay Shroff
2024-10-29 14:58 ` Keith Busch
2024-10-29 15:01 ` Keith Busch
2024-10-30 6:07 ` Nilay Shroff
2024-11-01 19:13 ` Keith Busch
2024-11-03 17:54 ` Nilay Shroff
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZyGX8jCZVXwbhI89@fedora \
--to=ming.lei@redhat.com \
--cc=axboe@fb.com \
--cc=chaitanyak@nvidia.com \
--cc=dlemoal@kernel.org \
--cc=gjoyce@linux.ibm.com \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=nilay@linux.ibm.com \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.