From: Ming Lei <ming.lei@redhat.com>
To: Daniel Wagner <dwagner@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>, Keith Busch <kbusch@kernel.org>,
Sagi Grimberg <sagi@grimberg.me>,
Thomas Gleixner <tglx@linutronix.de>,
Christoph Hellwig <hch@lst.de>,
"Martin K. Petersen" <martin.petersen@oracle.com>,
John Garry <john.g.garry@oracle.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
Jason Wang <jasowang@redhat.com>,
Kashyap Desai <kashyap.desai@broadcom.com>,
Sumit Saxena <sumit.saxena@broadcom.com>,
Shivasharan S <shivasharan.srikanteshwara@broadcom.com>,
Chandrakanth patil <chandrakanth.patil@broadcom.com>,
Sathya Prakash Veerichetty <sathya.prakash@broadcom.com>,
Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>,
Nilesh Javali <njavali@marvell.com>,
GR-QLogic-Storage-Upstream@marvell.com,
Jonathan Corbet <corbet@lwn.net>,
Frederic Weisbecker <frederic@kernel.org>,
Mel Gorman <mgorman@suse.de>, Hannes Reinecke <hare@suse.de>,
Sridhar Balaraman <sbalaraman@parallelwireless.com>,
"brookxu.cn" <brookxu.cn@gmail.com>,
linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org,
virtualization@lists.linux.dev, megaraidlinux.pdl@broadcom.com,
mpi3mr-linuxdrv.pdl@broadcom.com,
MPT-FusionLinux.pdl@broadcom.com, storagedev@microchip.com,
linux-doc@vger.kernel.org
Subject: Re: [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
Date: Fri, 9 Aug 2024 22:53:16 +0800 [thread overview]
Message-ID: <ZrYtXDrdPjn48r6k@fedora> (raw)
In-Reply-To: <856091db-431f-48f5-9daa-38c292a6bbd2@flourine.local>
On Fri, Aug 09, 2024 at 09:22:11AM +0200, Daniel Wagner wrote:
> On Thu, Aug 08, 2024 at 01:26:41PM GMT, Ming Lei wrote:
> > Isolated CPUs are removed from queue mapping in this patchset, when someone
> > submit IOs from the isolated CPU, what is the correct hctx used for handling
> > these IOs?
>
> No, every possible CPU gets a mapping. What this patch series does, is
> to limit/aligns the number of hardware context to the number of
> housekeeping CPUs. There is still a complete ctx-hctc mapping. So
OK, then I guess patch 1~7 aren't supposed to belong to this series,
cause you just want to reduce nr_hw_queues, meantime spread
house-keeping CPUs first for avoiding queues with all isolated cpu mask.
> whenever an user thread on an isolated CPU is issuing an IO a
> housekeeping CPU will also be involved (with the additional overhead,
> which seems to be okay for these users).
>
> Without hardware queue on the isolated CPUs ensures we really never get
> any unexpected IO on those CPUs unless userspace does it own its own.
> It's a safety net.
>
> Just to illustrate it, the non isolcpus configuration (default) map
> for an 8 CPU setup:
>
> queue mapping for /dev/vda
> hctx0: default 0
> hctx1: default 1
> hctx2: default 2
> hctx3: default 3
> hctx4: default 4
> hctx5: default 5
> hctx6: default 6
> hctx7: default 7
>
> and with isolcpus=io_queue,2-3,6-7
>
> queue mapping for /dev/vda
> hctx0: default 0 2
> hctx1: default 1 3
> hctx2: default 4 6
> hctx3: default 5 7
OK, Looks I missed the point in patch 15 in which you added isolated cpu
into mapping manually, just wondering why not take the current two-stage
policy to cover both house-keeping and isolated CPUs in group_cpus_evenly()?
Such as spread house-keeping CPUs first, then isolated CPUs, just like
what we did for present & non-present cpus.
Then the whole patchset can be simplified a lot.
>
> > From current implementation, it depends on implied zero filled
> > tag_set->map[type].mq_map[isolated_cpu], so hctx 0 is used.
> >
> > During CPU offline, in blk_mq_hctx_notify_offline(),
> > blk_mq_hctx_has_online_cpu() returns true even though the last cpu in
> > hctx 0 is offline because isolated cpus join hctx 0 unexpectedly, so IOs in
> > hctx 0 won't be drained.
> >
> > However managed irq core code still shutdowns the hw queue's irq because all
> > CPUs in this hctx are offline now. Then IO hang is triggered, isn't
> > it?
>
> Thanks for the explanation. I was able to reproduce this scenario, that
> is a hardware context with two CPUs which go offline. Initially, I used
> fio for creating the workload but this never hit the hanger. Instead
> some background workload from systemd-journald is pretty reliable to
> trigger the hanger you describe.
>
> Example:
>
> hctx2: default 4 6
>
> CPU 0 stays online, CPU 1-5 are offline. CPU 6 is offlined:
>
> smpboot: CPU 5 is now offline
> blk_mq_hctx_has_online_cpu:3537 hctx3 offline
> blk_mq_hctx_has_online_cpu:3537 hctx2 offline
>
> and there is no forward progress anymore, the cpuhotplug state machine
> is blocked and an IO is hanging:
>
> # grep busy /sys/kernel/debug/block/*/hctx*/tags | grep -v busy=0
> /sys/kernel/debug/block/vda/hctx2/tags:busy=61
>
> and blk_mq_hctx_notify_offline busy loops forever:
>
> task:cpuhp/6 state:D stack:0 pid:439 tgid:439 ppid:2 flags:0x00004000
> Call Trace:
> <TASK>
> __schedule+0x79d/0x15c0
> ? lockdep_hardirqs_on_prepare+0x152/0x210
> ? kvm_sched_clock_read+0xd/0x20
> ? local_clock_noinstr+0x28/0xb0
> ? local_clock+0x11/0x30
> ? lock_release+0x122/0x4a0
> schedule+0x3d/0xb0
> schedule_timeout+0x88/0xf0
> ? __pfx_process_timeout+0x10/0x10d
> msleep+0x28/0x40
> blk_mq_hctx_notify_offline+0x1b5/0x200
> ? cpuhp_thread_fun+0x41/0x1f0
> cpuhp_invoke_callback+0x27e/0x780
> ? __pfx_blk_mq_hctx_notify_offline+0x10/0x10
> ? cpuhp_thread_fun+0x42/0x1f0
> cpuhp_thread_fun+0x178/0x1f0
> smpboot_thread_fn+0x12e/0x1c0
> ? __pfx_smpboot_thread_fn+0x10/0x10
> kthread+0xe8/0x110
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x33/0x40
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
>
> I don't think this is a new problem this code introduces. This problem
> exists for any hardware context which has more than one CPU. As far I
> understand it, the problem is that there is no forward progress possible
> for the IO itself (I assume the corresponding resources for the CPU
When blk_mq_hctx_notify_offline() is running, the current CPU isn't
offline yet, and the hctx is active, same with the managed irq, so it is fine
to wait until all in-flight IOs originated from this hctx completed there.
The reason is why these requests can't be completed? And the forward
progress is provided by blk-mq. And these requests are very likely
allocated & submitted from CPU6.
Can you figure out what is effective mask for irq of hctx2? It is
supposed to be cpu6. And block debugfs for vda should provide helpful
hint.
> going offline have already been shutdown, thus no progress?) and
> blk_mq_hctx_notifiy_offline isn't doing anything in this scenario.
RH has internal cpu hotplug stress test, but not see such report so far.
I will try to setup such kind of setting and see if it can be
reproduced.
>
> Couldn't we do something like:
I usually won't thinking about any solution until root-cause is figured
out, :-)
Thanks,
Ming
next prev parent reply other threads:[~2024-08-09 14:53 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
2024-08-06 12:06 ` [PATCH v3 01/15] scsi: pm8001: do not overwrite PCI queue mapping Daniel Wagner
2024-08-06 13:24 ` Christoph Hellwig
2024-08-06 15:03 ` John Garry
2024-08-06 12:06 ` [PATCH v3 02/15] virito: add APIs for retrieving vq affinity Daniel Wagner
2024-08-06 13:25 ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 03/15] blk-mq: introduce blk_mq_dev_map_queues Daniel Wagner
2024-08-06 13:26 ` Christoph Hellwig
2024-08-07 12:49 ` Daniel Wagner
2024-08-12 9:05 ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 04/15] scsi: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues Daniel Wagner
2024-08-12 9:06 ` Christoph Hellwig
2024-08-12 15:31 ` John Garry
2024-08-13 9:39 ` Daniel Wagner
2024-08-06 12:06 ` [PATCH v3 05/15] nvme: " Daniel Wagner
2024-08-12 9:06 ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 06/15] virtio: blk/scs: replace blk_mq_virtio_map_queues " Daniel Wagner
2024-08-12 9:07 ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 07/15] blk-mq: remove unused queue mapping helpers Daniel Wagner
2024-08-12 9:08 ` Christoph Hellwig
2024-08-13 9:41 ` Daniel Wagner
2024-08-06 12:06 ` [PATCH v3 08/15] sched/isolation: Add io_queue housekeeping option Daniel Wagner
2024-08-06 12:06 ` [PATCH v3 09/15] docs: add io_queue as isolcpus options Daniel Wagner
2024-08-06 12:06 ` [PATCH v3 10/15] blk-mq: add number of queue calc helper Daniel Wagner
2024-08-12 9:03 ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 11/15] nvme-pci: use block layer helpers to calculate num of queues Daniel Wagner
2024-08-12 9:04 ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 12/15] scsi: " Daniel Wagner
2024-08-12 9:09 ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 13/15] virtio: blk/scsi: " Daniel Wagner
2024-08-06 12:06 ` [PATCH v3 14/15] lib/group_cpus.c: honor housekeeping config when grouping CPUs Daniel Wagner
2024-08-06 14:47 ` Ming Lei
2024-08-12 9:09 ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
2024-08-06 14:55 ` Ming Lei
2024-08-07 12:40 ` Daniel Wagner
2024-08-08 5:26 ` Ming Lei
2024-08-09 7:22 ` Daniel Wagner
2024-08-09 14:53 ` Ming Lei [this message]
2024-08-09 15:23 ` Ming Lei
2024-08-13 12:53 ` Daniel Wagner
2024-08-13 12:56 ` Ming Lei
2024-08-13 13:11 ` Daniel Wagner
2024-08-06 13:09 ` [PATCH v3 00/15] honor isolcpus configuration Stefan Hajnoczi
2024-08-07 12:25 ` Daniel Wagner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZrYtXDrdPjn48r6k@fedora \
--to=ming.lei@redhat.com \
--cc=GR-QLogic-Storage-Upstream@marvell.com \
--cc=MPT-FusionLinux.pdl@broadcom.com \
--cc=axboe@kernel.dk \
--cc=brookxu.cn@gmail.com \
--cc=chandrakanth.patil@broadcom.com \
--cc=corbet@lwn.net \
--cc=dwagner@suse.de \
--cc=frederic@kernel.org \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=jasowang@redhat.com \
--cc=john.g.garry@oracle.com \
--cc=kashyap.desai@broadcom.com \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=megaraidlinux.pdl@broadcom.com \
--cc=mgorman@suse.de \
--cc=mpi3mr-linuxdrv.pdl@broadcom.com \
--cc=mst@redhat.com \
--cc=njavali@marvell.com \
--cc=sagi@grimberg.me \
--cc=sathya.prakash@broadcom.com \
--cc=sbalaraman@parallelwireless.com \
--cc=shivasharan.srikanteshwara@broadcom.com \
--cc=storagedev@microchip.com \
--cc=suganath-prabu.subramani@broadcom.com \
--cc=sumit.saxena@broadcom.com \
--cc=tglx@linutronix.de \
--cc=virtualization@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).