From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: Michael Kelley <mhklinux@outlook.com>
Cc: Stuart Hayes <stuart.w.hayes@gmail.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"Rafael J . Wysocki" <rafael@kernel.org>,
Martin Belanger <Martin.Belanger@dell.com>,
Oliver O'Halloran <oohall@gmail.com>,
Daniel Wagner <dwagner@suse.de>, Keith Busch <kbusch@kernel.org>,
Lukas Wunner <lukas@wunner.de>,
David Jeffery <djeffery@redhat.com>,
Jeremy Allison <jallison@ciq.com>, Jens Axboe <axboe@fb.com>,
Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>,
"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
Nathan Chancellor <nathan@kernel.org>,
Jan Kiszka <jan.kiszka@seimens.com>,
Bert Karwatzki <spasswolf@web.de>
Subject: Re: [PATCH v9 0/4] shut down devices asynchronously
Date: Fri, 18 Oct 2024 07:49:51 +0200 [thread overview]
Message-ID: <2024101809-granola-coat-9a1d@gregkh> (raw)
In-Reply-To: <SN6PR02MB41571E2DD410D09CE7494B38D4402@SN6PR02MB4157.namprd02.prod.outlook.com>
On Fri, Oct 18, 2024 at 03:26:05AM +0000, Michael Kelley wrote:
> From: Michael Kelley <mhklinux@outlook.com> Sent: Thursday, October 10, 2024 9:22 PM
> >
> > From: Stuart Hayes <stuart.w.hayes@gmail.com> Sent: Wednesday, October 9, 2024 10:58 AM
> > >
> > > This adds the ability for the kernel to shutdown devices asynchronously.
> > >
> > > Only devices with drivers that enable it are shut down asynchronously.
> > >
> > > This can dramatically reduce system shutdown/reboot time on systems that
> > > have multiple devices that take many seconds to shut down (like certain
> > > NVMe drives). On one system tested, the shutdown time went from 11 minutes
> > > without this patch to 55 seconds with the patch.
> >
> > I've been testing this series against a 6.11.0 kernel in an Azure VM, which
> > is running as a guest on the Microsoft Hyper-V hypervisor. The VM has 16 vCPUs,
> > 128 Gbytes of memory, and two physical NVMe controllers that are mapped
> > into the VM so that it can access them directly.
> >
> [snip]
> >
> > But here's the kicker: The overall process of shutting down the devices
> > took *longer* with the patch set than without. Here's the same output
> > from a 6.11.0 kernel without the patch set:
> >
> [snip]
> >
> > Any thoughts on what might be causing this? I haven't gone into the
> > details of your algorithms for parallelizing, but is there any extra
> > overhead that could be adding to the time? Or maybe this is
> > something unique to Hyper-V guests. The overall difference is only
> > a few 10's of milliseconds, so not that big of a deal. But maybe it's
> > an indicator that something unexpected is happening that we should
> > understand.
> >
> > I'll keep thinking about the issue and see if I can get any more insight.
>
> I've debugged this enough to now know what is happening. I see the
> same behavior in two different environments:
>
> 1) A Hyper-V VM with 8 vCPUs running on my local laptop. It has
> no NVMe devices, so there's no opportunity for parallelism with this
> patch set.
>
> 2) A Raspberry Pi 5 with 4 CPUs. Linux is running on the bare metal.
> It has one NVMe device via the Pi 5 M.2 HAT.
>
> In both cases, the loop in device_shutdown() goes through a lot of
> devices: 492 in my Hyper-V VM, and 577 in the Pi 5. With the code
> in this patch set, all the devices get added to the async_wq in
> kernel/async.c. So async_wq quickly gets heavily populated.
>
> In the process, the workqueue code spins up additional worker threads
> to handle the load. On the Hyper-V VM, 210 to 230 new kernel
> threads are created during device_shutdown(), depending on the
> timing. On the Pi 5, 253 are created. The max for this workqueue is
> WQ_DFL_ACTIVE (256).
>
> Creating all these new worker threads, and scheduling and
> synchronizing them takes 30 to 40 milliseconds of additional time
> compared to the original code. On the Hyper-V VM, device_shutdown()
> completes in about 5 millisecond with the original code, and +/- 40
> milliseconds with the new async code. The Pi 5 results are more
> variable, but also have roughly 30 to 40 milliseconds additional.
>
> Is all this extra work a problem? I don't know. Clearly, there's big
> benefit in parallelizing the NVMe shutdown, and in those situations
> the extra 30 to 40 milliseconds can be ignored. But I wonder if there
> are other situations whether the extra elapsed time and CPU
> consumption might be a problem.
>
> I also wonder about scalability. Does everything still work if
> device_shutdown is processing 5000 devices? Is there a possibility
> of deadlock if async_wq can only have 256 active items out of
> 5000 queued ones?
In talking with someone off-list about this yesterday, we guessed that
this was the "thundering herd" problem that might be happening, and you
went and proved it was so! Thanks so much for doing this work.
I don't think we can put this type of load on all systems just to handle
one specific type of "bad" hardware that takes long periods of time to
shutdown, sorry.
Also think of systems with tens of thousands of devices connected to
them, with tiny 31bit processors (i.e. s390), shutting those down and
spinning up a thread for all of those devices is sure to cause us real
problems.
thanks,
greg k-h
next prev parent reply other threads:[~2024-10-18 5:49 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-09 17:57 [PATCH v9 0/4] shut down devices asynchronously Stuart Hayes
2024-10-09 17:57 ` [PATCH v9 1/4] driver core: don't always lock parent in shutdown Stuart Hayes
2024-10-09 17:57 ` [PATCH v9 2/4] driver core: separate function to shutdown one device Stuart Hayes
2024-10-09 17:57 ` [PATCH v9 3/4] driver core: shut down devices asynchronously Stuart Hayes
2024-10-09 17:57 ` [PATCH v9 4/4] nvme-pci: Make driver prefer asynchronous shutdown Stuart Hayes
2024-10-11 4:22 ` [PATCH v9 0/4] shut down devices asynchronously Michael Kelley
2024-10-11 15:52 ` Laurence Oberman
2024-10-18 3:26 ` Michael Kelley
2024-10-18 5:49 ` Greg Kroah-Hartman [this message]
2024-10-18 9:14 ` Lukas Wunner
2024-10-18 9:37 ` Greg Kroah-Hartman
2024-10-19 0:27 ` stuart hayes
2024-10-20 0:24 ` Michael Kelley
2024-10-29 15:32 ` Christoph Hellwig
2025-01-30 3:06 ` [PATCH] driver core: optimize async device shutdown Sultan Alsawaf
2025-01-30 17:12 ` [PATCH v2] " Sultan Alsawaf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2024101809-granola-coat-9a1d@gregkh \
--to=gregkh@linuxfoundation.org \
--cc=Martin.Belanger@dell.com \
--cc=axboe@fb.com \
--cc=djeffery@redhat.com \
--cc=dwagner@suse.de \
--cc=hch@lst.de \
--cc=jallison@ciq.com \
--cc=jan.kiszka@seimens.com \
--cc=kbusch@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=lukas@wunner.de \
--cc=mhklinux@outlook.com \
--cc=nathan@kernel.org \
--cc=oohall@gmail.com \
--cc=rafael@kernel.org \
--cc=sagi@grimberg.me \
--cc=spasswolf@web.de \
--cc=stuart.w.hayes@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox