linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/9] blk: honor isolcpus configuration
@ 2024-12-17 18:29 Daniel Wagner
  2024-12-17 18:29 ` [PATCH v4 1/9] lib/group_cpus: let group_cpu_evenly return number of groups Daniel Wagner
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: Daniel Wagner @ 2024-12-17 18:29 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Michael S. Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Pérez, Xuan Zhuo, Andrew Morton,
	Thomas Gleixner
  Cc: Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization, Daniel Wagner

The first part of the v3 of this series when in [3], so here is the
rebased, retested and slightly extended second halve.

During testing I run into the situation that group_cpu_evenly returned a
masks array which had less groups then requested; KASAN was unhappy and
the kernel also crashed. After starring for a long time on the code and
also reading Ming's comment in alloc_nodes_groups, this is something
which can happen, depending on the online/offline and with this series
isolated CPU configuration.

Second big discussion point in v3 is how to handle the situation when we
have an hardware context with a mapped housekeeping and and isolated CPU
and the housekeeping CPU goes offline. In this case there is nothing
handling IOs for the given context and when the isolated online CPUs is
issuing IO those will just hang.

I've experimented for a while and all solutions I came up were horrible
hacks (the hotpath needs to be touched) and I don't want to slow down all
other users (which are almost everyone). IMO, it's just not worth trying
to fix this corner case. If the user is using isolcpus and does CPU
hotplug, we can expect that the user can also first offline the isolated
CPUs. I've discussed this topic during ALPSS and the room came to the
same conclusion. Thus I just added a patch which issues a warning that
IOs are likely to hang.

Furthermore, I started to work getting the nvme-fabrics honoring the
isolcpu configuration as well but I decied to leave it away for now. No
need to make this series bigger again.

Daniel

ps: CCed a few more who were involved the isolcpu discussion during
plumbers.

Initial cover letter:

The nvme-pci driver is ignoring the isolcpus configuration. There were
several attempts to fix this in the past [1][2]. This is another attempt
but this time trying to address the feedback and solve it in the core
code.

The first patch introduces a new option for isolcpus 'io_queue', but I'm
not really sure if this is needed and we could just use the managed_irq
option instead. I guess depends if there is an use case which depens on
queues on the isolated CPUs.

The second patch introduces a new block layer helper which returns the
number of possible queues. I suspect it would makes sense also to make
this helper a bit smarter and also consider the number of queues the
hardware supports.

And the last patch updates the group_cpus_evenly function so that it uses
only the housekeeping CPUs when they are defined

Note this series is not addressing the affinity setting of the admin
queue (queue 0). I'd like to address this after we agreed on how to solve
this. Currently, the admin queue affinity can be controlled by the
irq_afffinity command line option, so there is at least a workaround for
it.

Baseline:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 1536 MB
node 0 free: 1227 MB
node 1 cpus: 4 5 6 7
node 1 size: 1729 MB
node 1 free: 1422 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

options nvme write_queues=4 poll_queues=4

55: 0 41 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 0-edge nvme0q0 affinity: 0-3
63: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 1-edge nvme0q1 affinity: 4-5
64: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 2-edge nvme0q2 affinity: 6-7
65: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 3-edge nvme0q3 affinity: 0-1
66: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 4-edge nvme0q4 affinity: 2-3
67: 0 0 0 0 24 0 0 0 PCI-MSIX-0000:00:05.0 5-edge nvme0q5 affinity: 4
68: 0 0 0 0 0 1 0 0 PCI-MSIX-0000:00:05.0 6-edge nvme0q6 affinity: 5
69: 0 0 0 0 0 0 41 0 PCI-MSIX-0000:00:05.0 7-edge nvme0q7 affinity: 6
70: 0 0 0 0 0 0 0 3 PCI-MSIX-0000:00:05.0 8-edge nvme0q8 affinity: 7
71: 1 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 9-edge nvme0q9 affinity: 0
72: 0 18 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 10-edge nvme0q10 affinity: 1
73: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 11-edge nvme0q11 affinity: 2
74: 0 0 0 3 0 0 0 0 PCI-MSIX-0000:00:05.0 12-edge nvme0q12 affinity: 3

queue mapping for /dev/nvme0n1
        hctx0: default 4 5
        hctx1: default 6 7
        hctx2: default 0 1
        hctx3: default 2 3
        hctx4: read 4
        hctx5: read 5
        hctx6: read 6
        hctx7: read 7
        hctx8: read 0
        hctx9: read 1
        hctx10: read 2
        hctx11: read 3
        hctx12: poll 4 5
        hctx13: poll 6 7
        hctx14: poll 0 1
        hctx15: poll 2 3

PCI name is 00:05.0: nvme0n1
        irq 55, cpu list 0-3, effective list 1
        irq 63, cpu list 4-5, effective list 5
        irq 64, cpu list 6-7, effective list 7
        irq 65, cpu list 0-1, effective list 1
        irq 66, cpu list 2-3, effective list 3
        irq 67, cpu list 4, effective list 4
        irq 68, cpu list 5, effective list 5
        irq 69, cpu list 6, effective list 6
        irq 70, cpu list 7, effective list 7
        irq 71, cpu list 0, effective list 0
        irq 72, cpu list 1, effective list 1
        irq 73, cpu list 2, effective list 2
        irq 74, cpu list 3, effective list 3

* patched:

48: 0 0 33 0 0 0 0 0 PCI-MSIX-0000:00:05.0 0-edge nvme0q0 affinity: 0-3
58: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 1-edge nvme0q1 affinity: 4
59: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 2-edge nvme0q2 affinity: 5
60: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 3-edge nvme0q3 affinity: 0
61: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 4-edge nvme0q4 affinity: 1
62: 0 0 0 0 45 0 0 0 PCI-MSIX-0000:00:05.0 5-edge nvme0q5 affinity: 4
63: 0 0 0 0 0 12 0 0 PCI-MSIX-0000:00:05.0 6-edge nvme0q6 affinity: 5
64: 2 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 7-edge nvme0q7 affinity: 0
65: 0 35 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 8-edge nvme0q8 affinity: 1

queue mapping for /dev/nvme0n1
        hctx0: default 2 3 4 6 7
        hctx1: default 5
        hctx2: default 0
        hctx3: default 1
        hctx4: read 4
        hctx5: read 5
        hctx6: read 0
        hctx7: read 1
        hctx8: poll 4
        hctx9: poll 5
        hctx10: poll 0
        hctx11: poll 1

PCI name is 00:05.0: nvme0n1
        irq 48, cpu list 0-3, effective list 2
        irq 58, cpu list 4, effective list 4
        irq 59, cpu list 5, effective list 5
        irq 60, cpu list 0, effective list 0
        irq 61, cpu list 1, effective list 1
        irq 62, cpu list 4, effective list 4
        irq 63, cpu list 5, effective list 5
        irq 64, cpu list 0, effective list 0
        irq 65, cpu list 1, effective list 1

[1] https://lore.kernel.org/lkml/20220423054331.GA17823@lst.de/T/#m9939195a465accbf83187caf346167c4242e798d
[2] https://lore.kernel.org/linux-nvme/87fruci5nj.ffs@tglx/
[3] https://lore.kernel.org/linux-nvme/20241202-refactor-blk-affinity-helpers-v6-0-27211e9c2cd5@kernel.org/

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
Changes in v4:
- added "blk-mq: issue warning when offlining hctx with online isolcpus"
- fixed check in cgroup_cpus_evenly, the if condition needs to use
  housekeeping_enabled() and not cpusmask_weight(housekeeping_masks),
  because the later will always return a valid mask.
- dropped fixed tag from "lib/group_cpus.c: honor housekeeping config when
  grouping CPUs"
- fixed overlong line "scsi: use block layer helpers to calculate num
  of queues"
- dropped "sched/isolation: Add io_queue housekeeping option",
  just document the housekeep enum hk_type
- added "lib/group_cpus: let group_cpu_evenly return number of groups"
- collected tags
- splitted series into a preperation series:
  https://lore.kernel.org/linux-nvme/20241202-refactor-blk-affinity-helpers-v6-0-27211e9c2cd5@kernel.org/
- Link to v3: https://lore.kernel.org/r/20240806-isolcpus-io-queues-v3-0-da0eecfeaf8b@suse.de

Changes in v3:
- lifted a couple of patches from
  https://lore.kernel.org/all/20210709081005.421340-1-ming.lei@redhat.com/
  "virito: add APIs for retrieving vq affinity"
  "blk-mq: introduce blk_mq_dev_map_queues"
- replaces all users of blk_mq_[pci|virtio]_map_queues with
  blk_mq_dev_map_queues
- updated/extended number of queue calc helpers
- add isolcpus=io_queue CPU-hctx mapping function
- documented enum hk_type and isolcpus=io_queue
- added "scsi: pm8001: do not overwrite PCI queue mapping"
- Link to v2: https://lore.kernel.org/r/20240627-isolcpus-io-queues-v2-0-26a32e3c4f75@suse.de

Changes in v2:
- updated documentation
- splitted blk/nvme-pci patch
- dropped HK_TYPE_IO_QUEUE, use HK_TYPE_MANAGED_IRQ
- Link to v1: https://lore.kernel.org/r/20240621-isolcpus-io-queues-v1-0-8b169bf41083@suse.de

---
Daniel Wagner (9):
      lib/group_cpus: let group_cpu_evenly return number of groups
      sched/isolation: document HK_TYPE housekeeping option
      blk-mq: add number of queue calc helper
      nvme-pci: use block layer helpers to calculate num of queues
      scsi: use block layer helpers to calculate num of queues
      virtio: blk/scsi: use block layer helpers to calculate num of queues
      lib/group_cpus: honor housekeeping config when grouping CPUs
      blk-mq: use hk cpus only when isolcpus=managed_irq is enabled
      blk-mq: issue warning when offlining hctx with online isolcpus

 block/blk-mq-cpumap.c                     | 118 +++++++++++++++++++++++++++++-
 block/blk-mq.c                            |  43 ++++++++++-
 drivers/block/virtio_blk.c                |   5 +-
 drivers/nvme/host/pci.c                   |   5 +-
 drivers/scsi/megaraid/megaraid_sas_base.c |  16 ++--
 drivers/scsi/qla2xxx/qla_isr.c            |  10 +--
 drivers/scsi/smartpqi/smartpqi_init.c     |   5 +-
 drivers/scsi/virtio_scsi.c                |   1 +
 drivers/virtio/virtio_vdpa.c              |   2 +-
 fs/fuse/virtio_fs.c                       |   7 +-
 include/linux/blk-mq.h                    |   2 +
 include/linux/group_cpus.h                |   2 +-
 include/linux/sched/isolation.h           |  13 ++++
 kernel/irq/affinity.c                     |   2 +-
 lib/group_cpus.c                          |  98 ++++++++++++++++++++++---
 15 files changed, 289 insertions(+), 40 deletions(-)
---
base-commit: 737371e839a368007758be329413b3f5ec9e7976
change-id: 20240620-isolcpus-io-queues-1a88eb47ff8b
prerequisite-message-id: 20241202-refactor-blk-affinity-helpers-v6-0-27211e9c2cd5@kernel.org
prerequisite-patch-id: e61d7c94facf2774a03c8c7d900f4c5c074bb7e7
prerequisite-patch-id: ed8d884af9fbc0961ad19e4c039136b6f371850f
prerequisite-patch-id: 639d56c8b9d83e516c652dea5f66ae878cd67bf8
prerequisite-patch-id: 79bc9d9cab5c55e31a8f780410a381d44d9c32eb
prerequisite-patch-id: 1fe870f3017d345011c53b51fefceaef97c29d0c
prerequisite-patch-id: 2612b868389ae9d851eb8b0230339faf17071a19
prerequisite-patch-id: c68b002174e1bfa1b2a7ac37e36ecdeec74e6d80
prerequisite-patch-id: 4d24a54316efce94595c811531ef0db9d94ff1b6

Best regards,
-- 
Daniel Wagner <wagi@kernel.org>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 1/9] lib/group_cpus: let group_cpu_evenly return number of groups
  2024-12-17 18:29 [PATCH v4 0/9] blk: honor isolcpus configuration Daniel Wagner
@ 2024-12-17 18:29 ` Daniel Wagner
  2025-01-07  7:51   ` Hannes Reinecke
  2024-12-17 18:29 ` [PATCH v4 2/9] sched/isolation: document HK_TYPE housekeeping option Daniel Wagner
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2024-12-17 18:29 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Michael S. Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Pérez, Xuan Zhuo, Andrew Morton,
	Thomas Gleixner
  Cc: Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization, Daniel Wagner

group_cpu_evenly might allocated less groups then the requested:

group_cpu_evenly
  __group_cpus_evenly
    alloc_nodes_groups
      # allocated total groups may be less than numgrps when
      # active total CPU number is less then numgrps

In this case, the caller will do an out of bound access because the
caller assumes the masks returned has numgrps.

Return the number of groups created so the caller can limit the access
range accordingly.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq-cpumap.c        |  7 ++++---
 drivers/virtio/virtio_vdpa.c |  2 +-
 fs/fuse/virtio_fs.c          |  7 ++++---
 include/linux/group_cpus.h   |  2 +-
 kernel/irq/affinity.c        |  2 +-
 lib/group_cpus.c             | 23 +++++++++++++----------
 6 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index ad8d6a363f24ae11968b42f7bcfd6a719a0499b7..85c0a7073bd8bff5d34aad1729d45d89da4c4bd1 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -19,9 +19,10 @@
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
 	const struct cpumask *masks;
-	unsigned int queue, cpu;
+	unsigned int queue, cpu, nr_masks;
 
-	masks = group_cpus_evenly(qmap->nr_queues);
+	nr_masks = qmap->nr_queues;
+	masks = group_cpus_evenly(&nr_masks);
 	if (!masks) {
 		for_each_possible_cpu(cpu)
 			qmap->mq_map[cpu] = qmap->queue_offset;
@@ -29,7 +30,7 @@ void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 	}
 
 	for (queue = 0; queue < qmap->nr_queues; queue++) {
-		for_each_cpu(cpu, &masks[queue])
+		for_each_cpu(cpu, &masks[queue % nr_masks])
 			qmap->mq_map[cpu] = qmap->queue_offset + queue;
 	}
 	kfree(masks);
diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 1f60c9d5cb1810a6f208c24bb2ac640d537391a0..c478cccf5fd68b9c9c01332046c24316573d97cd 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -330,7 +330,7 @@ create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
 		unsigned int this_vecs = affd->set_size[i];
 		int j;
-		struct cpumask *result = group_cpus_evenly(this_vecs);
+		struct cpumask *result = group_cpus_evenly(&this_vecs);
 
 		if (!result) {
 			kfree(masks);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 82afe78ec542358e2db6f4d955d521652ae363ec..5acd875f1e9c9840dd9d2f3245665c91230f57a8 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -862,7 +862,7 @@ static void virtio_fs_requests_done_work(struct work_struct *work)
 static void virtio_fs_map_queues(struct virtio_device *vdev, struct virtio_fs *fs)
 {
 	const struct cpumask *mask, *masks;
-	unsigned int q, cpu;
+	unsigned int q, cpu, nr_masks;
 
 	/* First attempt to map using existing transport layer affinities
 	 * e.g. PCIe MSI-X
@@ -882,7 +882,8 @@ static void virtio_fs_map_queues(struct virtio_device *vdev, struct virtio_fs *f
 	return;
 fallback:
 	/* Attempt to map evenly in groups over the CPUs */
-	masks = group_cpus_evenly(fs->num_request_queues);
+	nr_masks = fs->num_request_queues;
+	masks = group_cpus_evenly(&nr_masks);
 	/* If even this fails we default to all CPUs use first request queue */
 	if (!masks) {
 		for_each_possible_cpu(cpu)
@@ -891,7 +892,7 @@ static void virtio_fs_map_queues(struct virtio_device *vdev, struct virtio_fs *f
 	}
 
 	for (q = 0; q < fs->num_request_queues; q++) {
-		for_each_cpu(cpu, &masks[q])
+		for_each_cpu(cpu, &masks[q % nr_masks])
 			fs->mq_map[cpu] = q + VQ_REQUEST;
 	}
 	kfree(masks);
diff --git a/include/linux/group_cpus.h b/include/linux/group_cpus.h
index e42807ec61f6e8cf3787af7daa0d8686edfef0a3..8659534a3423e92746738ac57e713b7416e05271 100644
--- a/include/linux/group_cpus.h
+++ b/include/linux/group_cpus.h
@@ -9,6 +9,6 @@
 #include <linux/kernel.h>
 #include <linux/cpu.h>
 
-struct cpumask *group_cpus_evenly(unsigned int numgrps);
+struct cpumask *group_cpus_evenly(unsigned int *numgrps);
 
 #endif
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 44a4eba80315cc098ecfa366ca1d88483641b12a..0188e133f1a508a623e33f08a0fca2e1f2cbf4e4 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -71,7 +71,7 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
 		unsigned int this_vecs = affd->set_size[i];
 		int j;
-		struct cpumask *result = group_cpus_evenly(this_vecs);
+		struct cpumask *result = group_cpus_evenly(&this_vecs);
 
 		if (!result) {
 			kfree(masks);
diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index ee272c4cefcc13907ce9f211f479615d2e3c9154..73da83ca2c45347a3a443d42d4f16801a47effd5 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -334,7 +334,8 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
  * @numgrps: number of groups
  *
  * Return: cpumask array if successful, NULL otherwise. And each element
- * includes CPUs assigned to this group
+ * includes CPUs assigned to this group. numgrps will be updated to the
+ * actuall allocated number of masks.
  *
  * Try to put close CPUs from viewpoint of CPU and NUMA locality into
  * same group, and run two-stage grouping:
@@ -344,9 +345,9 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
  * We guarantee in the resulted grouping that all CPUs are covered, and
  * no same CPU is assigned to multiple groups
  */
-struct cpumask *group_cpus_evenly(unsigned int numgrps)
+struct cpumask *group_cpus_evenly(unsigned int *numgrps)
 {
-	unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
+	unsigned int curgrp = 0, nr_present = 0, nr_others = 0, nr_grps;
 	cpumask_var_t *node_to_cpumask;
 	cpumask_var_t nmsk, npresmsk;
 	int ret = -ENOMEM;
@@ -362,7 +363,8 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 	if (!node_to_cpumask)
 		goto fail_npresmsk;
 
-	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
+	nr_grps = *numgrps;
+	masks = kcalloc(nr_grps, sizeof(*masks), GFP_KERNEL);
 	if (!masks)
 		goto fail_node_to_cpumask;
 
@@ -383,7 +385,7 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 	cpumask_copy(npresmsk, data_race(cpu_present_mask));
 
 	/* grouping present CPUs first */
-	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
+	ret = __group_cpus_evenly(curgrp, nr_grps, node_to_cpumask,
 				  npresmsk, nmsk, masks);
 	if (ret < 0)
 		goto fail_build_affinity;
@@ -395,19 +397,19 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 	 * group space, assign the non present CPUs to the already
 	 * allocated out groups.
 	 */
-	if (nr_present >= numgrps)
+	if (nr_present >= nr_grps)
 		curgrp = 0;
 	else
 		curgrp = nr_present;
 	cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
-	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
+	ret = __group_cpus_evenly(curgrp, nr_grps, node_to_cpumask,
 				  npresmsk, nmsk, masks);
 	if (ret >= 0)
 		nr_others = ret;
 
  fail_build_affinity:
 	if (ret >= 0)
-		WARN_ON(nr_present + nr_others < numgrps);
+		WARN_ON(nr_present + nr_others < nr_grps);
 
  fail_node_to_cpumask:
 	free_node_to_cpumask(node_to_cpumask);
@@ -421,12 +423,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 		kfree(masks);
 		return NULL;
 	}
+	*numgrps = nr_present + nr_others;
 	return masks;
 }
 #else /* CONFIG_SMP */
-struct cpumask *group_cpus_evenly(unsigned int numgrps)
+struct cpumask *group_cpus_evenly(unsigned int *numgrps)
 {
-	struct cpumask *masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
+	struct cpumask *masks = kcalloc(*numgrps, sizeof(*masks), GFP_KERNEL);
 
 	if (!masks)
 		return NULL;

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 2/9] sched/isolation: document HK_TYPE housekeeping option
  2024-12-17 18:29 [PATCH v4 0/9] blk: honor isolcpus configuration Daniel Wagner
  2024-12-17 18:29 ` [PATCH v4 1/9] lib/group_cpus: let group_cpu_evenly return number of groups Daniel Wagner
@ 2024-12-17 18:29 ` Daniel Wagner
  2025-01-07 15:39   ` Waiman Long
  2024-12-17 18:29 ` [PATCH v4 3/9] blk-mq: add number of queue calc helper Daniel Wagner
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2024-12-17 18:29 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Michael S. Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Pérez, Xuan Zhuo, Andrew Morton,
	Thomas Gleixner
  Cc: Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization, Daniel Wagner

The enum is a public API which can be used all over the kernel. This
warrants a bit of documentation.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 include/linux/sched/isolation.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 2b461129d1fad0fd0ef1ad759fe44695dc635e8c..6649c3a48e0ea0a88c84bf5f2a74bff039fadaf2 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -6,6 +6,19 @@
 #include <linux/init.h>
 #include <linux/tick.h>
 
+/**
+ * enum hk_type - housekeeping cpu mask types
+ * @HK_TYPE_TIMER:	housekeeping cpu mask for timers
+ * @HK_TYPE_RCU:	housekeeping cpu mask for RCU
+ * @HK_TYPE_MISC:	housekeeping cpu mask for miscalleanous resources
+ * @HK_TYPE_SCHED:	housekeeping cpu mask for scheduling
+ * @HK_TYPE_TICK:	housekeeping cpu maks for timer tick
+ * @HK_TYPE_DOMAIN:	housekeeping cpu mask for general SMP balancing
+ *			and scheduling algoririthms
+ * @HK_TYPE_WQ:		housekeeping cpu mask for worksqueues
+ * @HK_TYPE_MANAGED_IRQ: housekeeping cpu mask for managed IRQs
+ * @HK_TYPE_KTHREAD:	housekeeping cpu mask for kthreads
+ */
 enum hk_type {
 	HK_TYPE_TIMER,
 	HK_TYPE_RCU,

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 3/9] blk-mq: add number of queue calc helper
  2024-12-17 18:29 [PATCH v4 0/9] blk: honor isolcpus configuration Daniel Wagner
  2024-12-17 18:29 ` [PATCH v4 1/9] lib/group_cpus: let group_cpu_evenly return number of groups Daniel Wagner
  2024-12-17 18:29 ` [PATCH v4 2/9] sched/isolation: document HK_TYPE housekeeping option Daniel Wagner
@ 2024-12-17 18:29 ` Daniel Wagner
  2025-01-08  7:04   ` Hannes Reinecke
  2024-12-17 18:29 ` [PATCH v4 4/9] nvme-pci: use block layer helpers to calculate num of queues Daniel Wagner
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2024-12-17 18:29 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Michael S. Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Pérez, Xuan Zhuo, Andrew Morton,
	Thomas Gleixner
  Cc: Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization, Daniel Wagner

Multiqueue devices should only allocate queues for the housekeeping CPUs
when isolcpus=managed_irq is set. This avoids that the isolated CPUs get
disturbed with OS workload.

Add two variants of helpers which calculates the correct number of
queues which should be used. The need for two variants is necessary
because some drivers calculate their max number of queues based on the
possible CPU mask, others based on the online CPU mask.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq-cpumap.c  | 45 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/blk-mq.h |  2 ++
 2 files changed, 47 insertions(+)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 85c0a7073bd8bff5d34aad1729d45d89da4c4bd1..b3a863c2db3231624685ab54a1810b22af4111f4 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -12,10 +12,55 @@
 #include <linux/cpu.h>
 #include <linux/group_cpus.h>
 #include <linux/device/bus.h>
+#include <linux/sched/isolation.h>
 
 #include "blk.h"
 #include "blk-mq.h"
 
+static unsigned int blk_mq_num_queues(const struct cpumask *mask,
+				      unsigned int max_queues)
+{
+	unsigned int num;
+
+	if (housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
+		mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
+
+	num = cpumask_weight(mask);
+	return min_not_zero(num, max_queues);
+}
+
+/**
+ * blk_mq_num_possible_queues - Calc nr of queues for multiqueue devices
+ * @max_queues:	The maximal number of queues the hardware/driver
+ *		supports. If max_queues is 0, the argument is
+ *		ignored.
+ *
+ * Calculate the number of queues which should be used for a multiqueue
+ * device based on the number of possible cpu. The helper is considering
+ * isolcpus settings.
+ */
+unsigned int blk_mq_num_possible_queues(unsigned int max_queues)
+{
+	return blk_mq_num_queues(cpu_possible_mask, max_queues);
+}
+EXPORT_SYMBOL_GPL(blk_mq_num_possible_queues);
+
+/**
+ * blk_mq_num_online_queues - Calc nr of queues for multiqueue devices
+ * @max_queues:	The maximal number of queues the hardware/driver
+ *		supports. If max_queues is 0, the argument is
+ *		ignored.
+ *
+ * Calculate the number of queues which should be used for a multiqueue
+ * device based on the number of online cpus. The helper is considering
+ * isolcpus settings.
+ */
+unsigned int blk_mq_num_online_queues(unsigned int max_queues)
+{
+	return blk_mq_num_queues(cpu_online_mask, max_queues);
+}
+EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
+
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
 	const struct cpumask *masks;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 769eab6247d4921e574e0828ab41a580a5a9f2fe..4f0f2ea64de2057750e88c2a3ff7d49e13a7bfc5 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -920,6 +920,8 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 void blk_mq_unfreeze_queue_non_owner(struct request_queue *q);
 void blk_freeze_queue_start_non_owner(struct request_queue *q);
 
+unsigned int blk_mq_num_possible_queues(unsigned int max_queues);
+unsigned int blk_mq_num_online_queues(unsigned int max_queues);
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap);
 void blk_mq_map_hw_queues(struct blk_mq_queue_map *qmap,
 			  struct device *dev, unsigned int offset);

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 4/9] nvme-pci: use block layer helpers to calculate num of queues
  2024-12-17 18:29 [PATCH v4 0/9] blk: honor isolcpus configuration Daniel Wagner
                   ` (2 preceding siblings ...)
  2024-12-17 18:29 ` [PATCH v4 3/9] blk-mq: add number of queue calc helper Daniel Wagner
@ 2024-12-17 18:29 ` Daniel Wagner
  2025-01-08  7:19   ` Hannes Reinecke
  2024-12-17 18:29 ` [PATCH v4 5/9] scsi: " Daniel Wagner
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2024-12-17 18:29 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Michael S. Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Pérez, Xuan Zhuo, Andrew Morton,
	Thomas Gleixner
  Cc: Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization, Daniel Wagner

Multiqueue devices should only allocate queues for the housekeeping CPUs
when isolcpus=managed_irq is set. This avoids that the isolated CPUs get
disturbed with OS workload.

Use helpers which calculates the correct number of queues which should
be used when isolcpus is used.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/nvme/host/pci.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 709328a67f915aede5c6bae956e1bdd5e6f3f1bc..4af22f09ed8474676edd118477344ed32236c497 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -81,7 +81,7 @@ static int io_queue_count_set(const char *val, const struct kernel_param *kp)
 	int ret;
 
 	ret = kstrtouint(val, 10, &n);
-	if (ret != 0 || n > num_possible_cpus())
+	if (ret != 0 || n > blk_mq_num_possible_queues(0))
 		return -EINVAL;
 	return param_set_uint(val, kp);
 }
@@ -2439,7 +2439,8 @@ static unsigned int nvme_max_io_queues(struct nvme_dev *dev)
 	 */
 	if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
 		return 1;
-	return num_possible_cpus() + dev->nr_write_queues + dev->nr_poll_queues;
+	return blk_mq_num_possible_queues(0) + dev->nr_write_queues +
+		dev->nr_poll_queues;
 }
 
 static int nvme_setup_io_queues(struct nvme_dev *dev)

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 5/9] scsi: use block layer helpers to calculate num of queues
  2024-12-17 18:29 [PATCH v4 0/9] blk: honor isolcpus configuration Daniel Wagner
                   ` (3 preceding siblings ...)
  2024-12-17 18:29 ` [PATCH v4 4/9] nvme-pci: use block layer helpers to calculate num of queues Daniel Wagner
@ 2024-12-17 18:29 ` Daniel Wagner
  2024-12-17 18:29 ` [PATCH v4 6/9] virtio: blk/scsi: " Daniel Wagner
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Daniel Wagner @ 2024-12-17 18:29 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Michael S. Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Pérez, Xuan Zhuo, Andrew Morton,
	Thomas Gleixner
  Cc: Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization, Daniel Wagner

Multiqueue devices should only allocate queues for the housekeeping CPUs
when isolcpus=managed_irq is set. This avoids that the isolated CPUs get
disturbed with OS workload.

Use helpers which calculates the correct number of queues which should
be used when isolcpus is used.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/scsi/megaraid/megaraid_sas_base.c | 15 +++++++++------
 drivers/scsi/qla2xxx/qla_isr.c            | 10 +++++-----
 drivers/scsi/smartpqi/smartpqi_init.c     |  5 ++---
 3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 49abd7dd75a7b7c1ddcfac41acecbbcf7de8f5a4..59d385e5a917979ae2f61f5db2c3355b9cab7e08 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -5962,7 +5962,8 @@ megasas_alloc_irq_vectors(struct megasas_instance *instance)
 		else
 			instance->iopoll_q_count = 0;
 
-		num_msix_req = num_online_cpus() + instance->low_latency_index_start;
+		num_msix_req = blk_mq_num_online_queues(0) +
+			instance->low_latency_index_start;
 		instance->msix_vectors = min(num_msix_req,
 				instance->msix_vectors);
 
@@ -5978,7 +5979,8 @@ megasas_alloc_irq_vectors(struct megasas_instance *instance)
 		/* Disable Balanced IOPS mode and try realloc vectors */
 		instance->perf_mode = MR_LATENCY_PERF_MODE;
 		instance->low_latency_index_start = 1;
-		num_msix_req = num_online_cpus() + instance->low_latency_index_start;
+		num_msix_req = blk_mq_num_online_queues(0) +
+			instance->low_latency_index_start;
 
 		instance->msix_vectors = min(num_msix_req,
 				instance->msix_vectors);
@@ -6234,7 +6236,7 @@ static int megasas_init_fw(struct megasas_instance *instance)
 		intr_coalescing = (scratch_pad_1 & MR_INTR_COALESCING_SUPPORT_OFFSET) ?
 								true : false;
 		if (intr_coalescing &&
-			(num_online_cpus() >= MR_HIGH_IOPS_QUEUE_COUNT) &&
+			(blk_mq_num_online_queues(0) >= MR_HIGH_IOPS_QUEUE_COUNT) &&
 			(instance->msix_vectors == MEGASAS_MAX_MSIX_QUEUES))
 			instance->perf_mode = MR_BALANCED_PERF_MODE;
 		else
@@ -6278,7 +6280,8 @@ static int megasas_init_fw(struct megasas_instance *instance)
 		else
 			instance->low_latency_index_start = 1;
 
-		num_msix_req = num_online_cpus() + instance->low_latency_index_start;
+		num_msix_req = blk_mq_num_online_queues(0) +
+			instance->low_latency_index_start;
 
 		instance->msix_vectors = min(num_msix_req,
 				instance->msix_vectors);
@@ -6310,8 +6313,8 @@ static int megasas_init_fw(struct megasas_instance *instance)
 	megasas_setup_reply_map(instance);
 
 	dev_info(&instance->pdev->dev,
-		"current msix/online cpus\t: (%d/%d)\n",
-		instance->msix_vectors, (unsigned int)num_online_cpus());
+		"current msix/max num queues\t: (%d/%u)\n",
+		instance->msix_vectors, blk_mq_num_online_queues(0));
 	dev_info(&instance->pdev->dev,
 		"RDPQ mode\t: (%s)\n", instance->is_rdpq ? "enabled" : "disabled");
 
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index fe98c76e9be32ff03a1960f366f0d700d1168383..c4c6b5c6658c0734f7ff68bcc31b33dde87296dd 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -4533,13 +4533,13 @@ qla24xx_enable_msix(struct qla_hw_data *ha, struct rsp_que *rsp)
 	if (USER_CTRL_IRQ(ha) || !ha->mqiobase) {
 		/* user wants to control IRQ setting for target mode */
 		ret = pci_alloc_irq_vectors(ha->pdev, min_vecs,
-		    min((u16)ha->msix_count, (u16)(num_online_cpus() + min_vecs)),
-		    PCI_IRQ_MSIX);
+			blk_mq_num_online_queues(ha->msix_count) + min_vecs,
+			PCI_IRQ_MSIX);
 	} else
 		ret = pci_alloc_irq_vectors_affinity(ha->pdev, min_vecs,
-		    min((u16)ha->msix_count, (u16)(num_online_cpus() + min_vecs)),
-		    PCI_IRQ_MSIX | PCI_IRQ_AFFINITY,
-		    &desc);
+			blk_mq_num_online_queues(ha->msix_count) + min_vecs,
+			PCI_IRQ_MSIX | PCI_IRQ_AFFINITY,
+			&desc);
 
 	if (ret < 0) {
 		ql_log(ql_log_fatal, vha, 0x00c7,
diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index 04fb24d77e9b5c0137f26bc41f17191cc4c49728..7636c8d1c9f14a0d887c1d517c3664f0d0df7e6e 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -5278,15 +5278,14 @@ static void pqi_calculate_queue_resources(struct pqi_ctrl_info *ctrl_info)
 	if (reset_devices) {
 		num_queue_groups = 1;
 	} else {
-		int num_cpus;
 		int max_queue_groups;
 
 		max_queue_groups = min(ctrl_info->max_inbound_queues / 2,
 			ctrl_info->max_outbound_queues - 1);
 		max_queue_groups = min(max_queue_groups, PQI_MAX_QUEUE_GROUPS);
 
-		num_cpus = num_online_cpus();
-		num_queue_groups = min(num_cpus, ctrl_info->max_msix_vectors);
+		num_queue_groups =
+			blk_mq_num_online_queues(ctrl_info->max_msix_vectors);
 		num_queue_groups = min(num_queue_groups, max_queue_groups);
 	}
 

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 6/9] virtio: blk/scsi: use block layer helpers to calculate num of queues
  2024-12-17 18:29 [PATCH v4 0/9] blk: honor isolcpus configuration Daniel Wagner
                   ` (4 preceding siblings ...)
  2024-12-17 18:29 ` [PATCH v4 5/9] scsi: " Daniel Wagner
@ 2024-12-17 18:29 ` Daniel Wagner
  2024-12-19  6:25   ` Christoph Hellwig
  2024-12-19  8:31   ` Michael S. Tsirkin
  2024-12-17 18:29 ` [PATCH v4 7/9] lib/group_cpus: honor housekeeping config when grouping CPUs Daniel Wagner
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 29+ messages in thread
From: Daniel Wagner @ 2024-12-17 18:29 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Michael S. Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Pérez, Xuan Zhuo, Andrew Morton,
	Thomas Gleixner
  Cc: Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization, Daniel Wagner

Multiqueue devices should only allocate queues for the housekeeping CPUs
when isolcpus=managed_irq is set. This avoids that the isolated CPUs get
disturbed with OS workload.

Use helpers which calculates the correct number of queues which should
be used when isolcpus is used.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/block/virtio_blk.c                | 5 ++---
 drivers/scsi/megaraid/megaraid_sas_base.c | 3 ++-
 drivers/scsi/virtio_scsi.c                | 1 +
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index ed514ff46dc82acd629ae594cb0fa097bd301a9b..0287ceaaf19972f3a18e81cd2e3252e4d539ba93 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -976,9 +976,8 @@ static int init_vq(struct virtio_blk *vblk)
 		return -EINVAL;
 	}
 
-	num_vqs = min_t(unsigned int,
-			min_not_zero(num_request_queues, nr_cpu_ids),
-			num_vqs);
+	num_vqs = blk_mq_num_possible_queues(
+			min_not_zero(num_request_queues, num_vqs));
 
 	num_poll_vqs = min_t(unsigned int, poll_queues, num_vqs - 1);
 
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 59d385e5a917979ae2f61f5db2c3355b9cab7e08..3ff0978b3acb5baf757fee25d9fccf4971976272 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -6236,7 +6236,8 @@ static int megasas_init_fw(struct megasas_instance *instance)
 		intr_coalescing = (scratch_pad_1 & MR_INTR_COALESCING_SUPPORT_OFFSET) ?
 								true : false;
 		if (intr_coalescing &&
-			(blk_mq_num_online_queues(0) >= MR_HIGH_IOPS_QUEUE_COUNT) &&
+			(blk_mq_num_online_queues(0) >=
+			 MR_HIGH_IOPS_QUEUE_COUNT) &&
 			(instance->msix_vectors == MEGASAS_MAX_MSIX_QUEUES))
 			instance->perf_mode = MR_BALANCED_PERF_MODE;
 		else
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 60be1a0c61836ba643adcf9ad8d5b68563a86cb1..46ca0b82f57ce2211c7e2817dd40ee34e65bcbf9 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -919,6 +919,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
 	/* We need to know how many queues before we allocate. */
 	num_queues = virtscsi_config_get(vdev, num_queues) ? : 1;
 	num_queues = min_t(unsigned int, nr_cpu_ids, num_queues);
+	num_queues = blk_mq_num_possible_queues(num_queues);
 
 	num_targets = virtscsi_config_get(vdev, max_target) + 1;
 

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 7/9] lib/group_cpus: honor housekeeping config when grouping CPUs
  2024-12-17 18:29 [PATCH v4 0/9] blk: honor isolcpus configuration Daniel Wagner
                   ` (5 preceding siblings ...)
  2024-12-17 18:29 ` [PATCH v4 6/9] virtio: blk/scsi: " Daniel Wagner
@ 2024-12-17 18:29 ` Daniel Wagner
  2024-12-17 18:29 ` [PATCH v4 8/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled Daniel Wagner
  2024-12-17 18:29 ` [PATCH v4 9/9] blk-mq: issue warning when offlining hctx with online isolcpus Daniel Wagner
  8 siblings, 0 replies; 29+ messages in thread
From: Daniel Wagner @ 2024-12-17 18:29 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Michael S. Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Pérez, Xuan Zhuo, Andrew Morton,
	Thomas Gleixner
  Cc: Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization, Daniel Wagner

group_cpus_evenly distributes all present CPUs into groups. This ignores
the isolcpus configuration and assigns isolated CPUs into the groups.

Make group_cpus_evenly aware of isolcpus configuration and use the
housekeeping CPU mask as base for distributing the available CPUs into
groups.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 lib/group_cpus.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 75 insertions(+), 2 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 73da83ca2c45347a3a443d42d4f16801a47effd5..927e4ed634d0d9ca14235c977fc53d6f5f649396 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -8,6 +8,7 @@
 #include <linux/cpu.h>
 #include <linux/sort.h>
 #include <linux/group_cpus.h>
+#include <linux/sched/isolation.h>
 
 #ifdef CONFIG_SMP
 
@@ -330,7 +331,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
 }
 
 /**
- * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * group_possible_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
  * @numgrps: number of groups
  *
  * Return: cpumask array if successful, NULL otherwise. And each element
@@ -345,7 +346,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
  * We guarantee in the resulted grouping that all CPUs are covered, and
  * no same CPU is assigned to multiple groups
  */
-struct cpumask *group_cpus_evenly(unsigned int *numgrps)
+static struct cpumask *group_possible_cpus_evenly(unsigned int *numgrps)
 {
 	unsigned int curgrp = 0, nr_present = 0, nr_others = 0, nr_grps;
 	cpumask_var_t *node_to_cpumask;
@@ -426,6 +427,78 @@ struct cpumask *group_cpus_evenly(unsigned int *numgrps)
 	*numgrps = nr_present + nr_others;
 	return masks;
 }
+
+/**
+ * group_mask_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * @numgrps: number of groups
+ * @cpu_mask: CPU to consider for the grouping
+ *
+ * Return: cpumask array if successful, NULL otherwise. And each element
+ * includes CPUs assigned to this group.
+ *
+ * Try to put close CPUs from viewpoint of CPU and NUMA locality into
+ * same group. Allocate present CPUs on these groups evenly.
+ */
+static struct cpumask *group_mask_cpus_evenly(unsigned int *numgrps,
+					      const struct cpumask *cpu_mask)
+{
+	cpumask_var_t *node_to_cpumask;
+	cpumask_var_t nmsk;
+	unsigned int nr_grps;
+	int ret = -ENOMEM;
+	struct cpumask *masks = NULL;
+
+	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+		return NULL;
+
+	node_to_cpumask = alloc_node_to_cpumask();
+	if (!node_to_cpumask)
+		goto fail_nmsk;
+
+	nr_grps = *numgrps;
+	masks = kcalloc(nr_grps, sizeof(*masks), GFP_KERNEL);
+	if (!masks)
+		goto fail_node_to_cpumask;
+
+	build_node_to_cpumask(node_to_cpumask);
+
+	ret = __group_cpus_evenly(0, nr_grps, node_to_cpumask, cpu_mask, nmsk,
+				  masks);
+
+fail_node_to_cpumask:
+	free_node_to_cpumask(node_to_cpumask);
+
+fail_nmsk:
+	free_cpumask_var(nmsk);
+	if (ret < 0) {
+		kfree(masks);
+		return NULL;
+	}
+	*numgrps = ret;
+	return masks;
+}
+
+/**
+ * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * @numgrps: number of groups
+ *
+ * Return: cpumask array if successful, NULL otherwise.
+ *
+ * group_possible_cpus_evently() is used for distributing the cpus on all
+ * possible cpus in absence of isolcpus command line argument.
+ * group_mask_cpu_evenly() is used when the isolcpus command line
+ * argument is used with managed_irq option. In this case only the
+ * housekeeping CPUs are considered.
+ */
+struct cpumask *group_cpus_evenly(unsigned int *numgrps)
+{
+	if (housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
+		return group_mask_cpus_evenly(numgrps,
+				housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
+	}
+
+	return group_possible_cpus_evenly(numgrps);
+}
 #else /* CONFIG_SMP */
 struct cpumask *group_cpus_evenly(unsigned int *numgrps)
 {

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 8/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled
  2024-12-17 18:29 [PATCH v4 0/9] blk: honor isolcpus configuration Daniel Wagner
                   ` (6 preceding siblings ...)
  2024-12-17 18:29 ` [PATCH v4 7/9] lib/group_cpus: honor housekeeping config when grouping CPUs Daniel Wagner
@ 2024-12-17 18:29 ` Daniel Wagner
  2024-12-19  6:26   ` Christoph Hellwig
  2024-12-19  9:20   ` Ming Lei
  2024-12-17 18:29 ` [PATCH v4 9/9] blk-mq: issue warning when offlining hctx with online isolcpus Daniel Wagner
  8 siblings, 2 replies; 29+ messages in thread
From: Daniel Wagner @ 2024-12-17 18:29 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Michael S. Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Pérez, Xuan Zhuo, Andrew Morton,
	Thomas Gleixner
  Cc: Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization, Daniel Wagner

When isolcpus=managed_irq is enabled all hardware queues should run on
the housekeeping CPUs only. Thus ignore the affinity mask provided by
the driver. Also we can't use blk_mq_map_queues because it will map all
CPUs to first hctx unless, the CPU is the same as the hctx has the
affinity set to, e.g. 8 CPUs with isolcpus=managed_irq,2-3,6-7 config

  queue mapping for /dev/nvme0n1
        hctx0: default 2 3 4 6 7
        hctx1: default 5
        hctx2: default 0
        hctx3: default 1

  PCI name is 00:05.0: nvme0n1
        irq 57 affinity 0-1 effective 1 is_managed:0 nvme0q0
        irq 58 affinity 4 effective 4 is_managed:1 nvme0q1
        irq 59 affinity 5 effective 5 is_managed:1 nvme0q2
        irq 60 affinity 0 effective 0 is_managed:1 nvme0q3
        irq 61 affinity 1 effective 1 is_managed:1 nvme0q4

where as with blk_mq_hk_map_queues we get

  queue mapping for /dev/nvme0n1
        hctx0: default 2 4
        hctx1: default 3 5
        hctx2: default 0 6
        hctx3: default 1 7

  PCI name is 00:05.0: nvme0n1
        irq 56 affinity 0-1 effective 1 is_managed:0 nvme0q0
        irq 61 affinity 4 effective 4 is_managed:1 nvme0q1
        irq 62 affinity 5 effective 5 is_managed:1 nvme0q2
        irq 63 affinity 0 effective 0 is_managed:1 nvme0q3
        irq 64 affinity 1 effective 1 is_managed:1 nvme0q4

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq-cpumap.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index b3a863c2db3231624685ab54a1810b22af4111f4..38016bf1be8af14ef368e68d3fd12416858e3da6 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -61,11 +61,74 @@ unsigned int blk_mq_num_online_queues(unsigned int max_queues)
 }
 EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
 
+/*
+ * blk_mq_map_hk_queues - Create housekeeping CPU to hardware queue mapping
+ * @qmap:	CPU to hardware queue map
+ *
+ * Create a housekeeping CPU to hardware queue mapping in @qmap. If the
+ * isolcpus feature is enabled and blk_mq_map_hk_queues returns true,
+ * @qmap contains a valid configuration honoring the managed_irq
+ * configuration. If the isolcpus feature is disabled this function
+ * returns false.
+ */
+static bool blk_mq_map_hk_queues(struct blk_mq_queue_map *qmap)
+{
+	struct cpumask *hk_masks;
+	cpumask_var_t isol_mask;
+	unsigned int queue, cpu, nr_masks;
+
+	if (!housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
+		return false;
+
+	/* map housekeeping cpus to matching hardware context */
+	nr_masks = qmap->nr_queues;
+	hk_masks = group_cpus_evenly(&nr_masks);
+	if (!hk_masks)
+		goto fallback;
+
+	for (queue = 0; queue < qmap->nr_queues; queue++) {
+		for_each_cpu(cpu, &hk_masks[queue % nr_masks])
+			qmap->mq_map[cpu] = qmap->queue_offset + queue;
+	}
+
+	kfree(hk_masks);
+
+	/* map isolcpus to hardware context */
+	if (!alloc_cpumask_var(&isol_mask, GFP_KERNEL))
+		goto fallback;
+
+	queue = 0;
+	cpumask_andnot(isol_mask,
+		       cpu_possible_mask,
+		       housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
+
+	for_each_cpu(cpu, isol_mask) {
+		qmap->mq_map[cpu] = qmap->queue_offset + queue;
+		queue = (queue + 1) % qmap->nr_queues;
+	}
+
+	free_cpumask_var(isol_mask);
+
+	return true;
+
+fallback:
+	/* map all cpus to hardware context ignoring any affinity */
+	queue = 0;
+	for_each_possible_cpu(cpu) {
+		qmap->mq_map[cpu] = qmap->queue_offset + queue;
+		queue = (queue + 1) % qmap->nr_queues;
+	}
+	return true;
+}
+
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
 	const struct cpumask *masks;
 	unsigned int queue, cpu, nr_masks;
 
+	if (blk_mq_map_hk_queues(qmap))
+		return;
+
 	nr_masks = qmap->nr_queues;
 	masks = group_cpus_evenly(&nr_masks);
 	if (!masks) {
@@ -121,6 +184,9 @@ void blk_mq_map_hw_queues(struct blk_mq_queue_map *qmap,
 	if (!dev->bus->irq_get_affinity)
 		goto fallback;
 
+	if (blk_mq_map_hk_queues(qmap))
+		return;
+
 	for (queue = 0; queue < qmap->nr_queues; queue++) {
 		mask = dev->bus->irq_get_affinity(dev, queue + offset);
 		if (!mask)

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 9/9] blk-mq: issue warning when offlining hctx with online isolcpus
  2024-12-17 18:29 [PATCH v4 0/9] blk: honor isolcpus configuration Daniel Wagner
                   ` (7 preceding siblings ...)
  2024-12-17 18:29 ` [PATCH v4 8/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled Daniel Wagner
@ 2024-12-17 18:29 ` Daniel Wagner
  2024-12-19  6:28   ` Christoph Hellwig
  2024-12-20  9:04   ` Ming Lei
  8 siblings, 2 replies; 29+ messages in thread
From: Daniel Wagner @ 2024-12-17 18:29 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Michael S. Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Pérez, Xuan Zhuo, Andrew Morton,
	Thomas Gleixner
  Cc: Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization, Daniel Wagner

When we offlining a hardware context which also serves isolcpus mapped
to it, any IO issued by the isolcpus will stall as there is nothing
which handles the interrupts etc.

This configuration/setup is not supported at this point thus just issue
a warning.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index de15c0c76f874a2a863b05a23e0f3dba20cb6488..f9af0f5dd6aac8da855777acf2ffc61128f15a74 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3619,6 +3619,45 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
 	return data.has_rq;
 }
 
+static void blk_mq_hctx_check_isolcpus_online(struct blk_mq_hw_ctx *hctx, unsigned int cpu)
+{
+	const struct cpumask *hk_mask;
+	int i;
+
+	if (!housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
+		return;
+
+	hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
+
+	for (i = 0; i < hctx->nr_ctx; i++) {
+		struct blk_mq_ctx *ctx = hctx->ctxs[i];
+
+		if (ctx->cpu == cpu)
+			continue;
+
+		/*
+		 * Check if this context has at least one online
+		 * housekeeping CPU in this case the hardware context is
+		 * usable.
+		 */
+		if (cpumask_test_cpu(ctx->cpu, hk_mask) &&
+		    cpu_online(ctx->cpu))
+			break;
+
+		/*
+		 * The context doesn't have any online housekeeping CPUs
+		 * but there might be an online isolated CPU mapped to
+		 * it.
+		 */
+		if (cpu_is_offline(ctx->cpu))
+			continue;
+
+		pr_warn("%s: offlining hctx%d but there is still an online isolcpu CPU %d mapped to it, IO stalls expected\n",
+			hctx->queue->disk->disk_name,
+			hctx->queue_num, ctx->cpu);
+	}
+}
+
 static bool blk_mq_hctx_has_online_cpu(struct blk_mq_hw_ctx *hctx,
 		unsigned int this_cpu)
 {
@@ -3638,8 +3677,10 @@ static bool blk_mq_hctx_has_online_cpu(struct blk_mq_hw_ctx *hctx,
 			continue;
 
 		/* this hctx has at least one online CPU */
-		if (this_cpu != cpu)
+		if (this_cpu != cpu) {
+			blk_mq_hctx_check_isolcpus_online(hctx, this_cpu);
 			return true;
+		}
 	}
 
 	return false;

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 6/9] virtio: blk/scsi: use block layer helpers to calculate num of queues
  2024-12-17 18:29 ` [PATCH v4 6/9] virtio: blk/scsi: " Daniel Wagner
@ 2024-12-19  6:25   ` Christoph Hellwig
  2024-12-19  8:31   ` Michael S. Tsirkin
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2024-12-19  6:25 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Michael S. Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Pérez, Xuan Zhuo, Andrew Morton,
	Thomas Gleixner, Costa Shulyupin, Juri Lelli, Valentin Schneider,
	Waiman Long, Ming Lei, Michal Koutný, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 8/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled
  2024-12-17 18:29 ` [PATCH v4 8/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled Daniel Wagner
@ 2024-12-19  6:26   ` Christoph Hellwig
  2024-12-19  9:20   ` Ming Lei
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2024-12-19  6:26 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Michael S. Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Pérez, Xuan Zhuo, Andrew Morton,
	Thomas Gleixner, Costa Shulyupin, Juri Lelli, Valentin Schneider,
	Waiman Long, Ming Lei, Michal Koutný, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 9/9] blk-mq: issue warning when offlining hctx with online isolcpus
  2024-12-17 18:29 ` [PATCH v4 9/9] blk-mq: issue warning when offlining hctx with online isolcpus Daniel Wagner
@ 2024-12-19  6:28   ` Christoph Hellwig
  2024-12-20  9:04   ` Ming Lei
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2024-12-19  6:28 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Michael S. Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Pérez, Xuan Zhuo, Andrew Morton,
	Thomas Gleixner, Costa Shulyupin, Juri Lelli, Valentin Schneider,
	Waiman Long, Ming Lei, Michal Koutný, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization

On Tue, Dec 17, 2024 at 07:29:43PM +0100, Daniel Wagner wrote:
> When we offlining a hardware context which also serves isolcpus mapped
> to it, any IO issued by the isolcpus will stall as there is nothing
> which handles the interrupts etc.
> 
> This configuration/setup is not supported at this point thus just issue
> a warning.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>  block/blk-mq.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 42 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index de15c0c76f874a2a863b05a23e0f3dba20cb6488..f9af0f5dd6aac8da855777acf2ffc61128f15a74 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3619,6 +3619,45 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
>  	return data.has_rq;
>  }
>  
> +static void blk_mq_hctx_check_isolcpus_online(struct blk_mq_hw_ctx *hctx, unsigned int cpu)

Please avoid the overly long line here.

> +{
> +	const struct cpumask *hk_mask;
> +	int i;
> +
> +	if (!housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
> +		return;
> +
> +	hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
> +
> +	for (i = 0; i < hctx->nr_ctx; i++) {
> +		struct blk_mq_ctx *ctx = hctx->ctxs[i];
> +
> +		if (ctx->cpu == cpu)
> +			continue;
> +
> +		/*
> +		 * Check if this context has at least one online
> +		 * housekeeping CPU in this case the hardware context is
> +		 * usable.

But here you;re not even using up all 80 characters for the comment.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 6/9] virtio: blk/scsi: use block layer helpers to calculate num of queues
  2024-12-17 18:29 ` [PATCH v4 6/9] virtio: blk/scsi: " Daniel Wagner
  2024-12-19  6:25   ` Christoph Hellwig
@ 2024-12-19  8:31   ` Michael S. Tsirkin
  1 sibling, 0 replies; 29+ messages in thread
From: Michael S. Tsirkin @ 2024-12-19  8:31 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Pérez, Xuan Zhuo, Andrew Morton, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization

On Tue, Dec 17, 2024 at 07:29:40PM +0100, Daniel Wagner wrote:
> Multiqueue devices should only allocate queues for the housekeeping CPUs
> when isolcpus=managed_irq is set. This avoids that the isolated CPUs get
> disturbed with OS workload.
> 
> Use helpers which calculates the correct number of queues which should
> be used when isolcpus is used.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>

The subject is misleading, one thinks it is onlu virtio blk.
It's best to just split each driver in a patch by its own.
for the changes in virtio:

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  drivers/block/virtio_blk.c                | 5 ++---
>  drivers/scsi/megaraid/megaraid_sas_base.c | 3 ++-
>  drivers/scsi/virtio_scsi.c                | 1 +
>  3 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index ed514ff46dc82acd629ae594cb0fa097bd301a9b..0287ceaaf19972f3a18e81cd2e3252e4d539ba93 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -976,9 +976,8 @@ static int init_vq(struct virtio_blk *vblk)
>  		return -EINVAL;
>  	}
>  
> -	num_vqs = min_t(unsigned int,
> -			min_not_zero(num_request_queues, nr_cpu_ids),
> -			num_vqs);
> +	num_vqs = blk_mq_num_possible_queues(
> +			min_not_zero(num_request_queues, num_vqs));
>  
>  	num_poll_vqs = min_t(unsigned int, poll_queues, num_vqs - 1);
>  
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
> index 59d385e5a917979ae2f61f5db2c3355b9cab7e08..3ff0978b3acb5baf757fee25d9fccf4971976272 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -6236,7 +6236,8 @@ static int megasas_init_fw(struct megasas_instance *instance)
>  		intr_coalescing = (scratch_pad_1 & MR_INTR_COALESCING_SUPPORT_OFFSET) ?
>  								true : false;
>  		if (intr_coalescing &&
> -			(blk_mq_num_online_queues(0) >= MR_HIGH_IOPS_QUEUE_COUNT) &&
> +			(blk_mq_num_online_queues(0) >=
> +			 MR_HIGH_IOPS_QUEUE_COUNT) &&
>  			(instance->msix_vectors == MEGASAS_MAX_MSIX_QUEUES))
>  			instance->perf_mode = MR_BALANCED_PERF_MODE;
>  		else
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 60be1a0c61836ba643adcf9ad8d5b68563a86cb1..46ca0b82f57ce2211c7e2817dd40ee34e65bcbf9 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -919,6 +919,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
>  	/* We need to know how many queues before we allocate. */
>  	num_queues = virtscsi_config_get(vdev, num_queues) ? : 1;
>  	num_queues = min_t(unsigned int, nr_cpu_ids, num_queues);
> +	num_queues = blk_mq_num_possible_queues(num_queues);
>  
>  	num_targets = virtscsi_config_get(vdev, max_target) + 1;
>  
> 
> -- 
> 2.47.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 8/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled
  2024-12-17 18:29 ` [PATCH v4 8/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled Daniel Wagner
  2024-12-19  6:26   ` Christoph Hellwig
@ 2024-12-19  9:20   ` Ming Lei
  2024-12-19 15:38     ` Daniel Wagner
  1 sibling, 1 reply; 29+ messages in thread
From: Ming Lei @ 2024-12-19  9:20 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Michael S. Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Pérez, Xuan Zhuo, Andrew Morton,
	Thomas Gleixner, Costa Shulyupin, Juri Lelli, Valentin Schneider,
	Waiman Long, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization

On Tue, Dec 17, 2024 at 07:29:42PM +0100, Daniel Wagner wrote:
> When isolcpus=managed_irq is enabled all hardware queues should run on
> the housekeeping CPUs only. Thus ignore the affinity mask provided by
> the driver. Also we can't use blk_mq_map_queues because it will map all
> CPUs to first hctx unless, the CPU is the same as the hctx has the
> affinity set to, e.g. 8 CPUs with isolcpus=managed_irq,2-3,6-7 config
> 
>   queue mapping for /dev/nvme0n1
>         hctx0: default 2 3 4 6 7
>         hctx1: default 5
>         hctx2: default 0
>         hctx3: default 1
> 
>   PCI name is 00:05.0: nvme0n1
>         irq 57 affinity 0-1 effective 1 is_managed:0 nvme0q0
>         irq 58 affinity 4 effective 4 is_managed:1 nvme0q1
>         irq 59 affinity 5 effective 5 is_managed:1 nvme0q2
>         irq 60 affinity 0 effective 0 is_managed:1 nvme0q3
>         irq 61 affinity 1 effective 1 is_managed:1 nvme0q4
> 
> where as with blk_mq_hk_map_queues we get
> 
>   queue mapping for /dev/nvme0n1
>         hctx0: default 2 4
>         hctx1: default 3 5
>         hctx2: default 0 6
>         hctx3: default 1 7
> 
>   PCI name is 00:05.0: nvme0n1
>         irq 56 affinity 0-1 effective 1 is_managed:0 nvme0q0
>         irq 61 affinity 4 effective 4 is_managed:1 nvme0q1
>         irq 62 affinity 5 effective 5 is_managed:1 nvme0q2
>         irq 63 affinity 0 effective 0 is_managed:1 nvme0q3
>         irq 64 affinity 1 effective 1 is_managed:1 nvme0q4
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>  block/blk-mq-cpumap.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 66 insertions(+)
> 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index b3a863c2db3231624685ab54a1810b22af4111f4..38016bf1be8af14ef368e68d3fd12416858e3da6 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -61,11 +61,74 @@ unsigned int blk_mq_num_online_queues(unsigned int max_queues)
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
>  
> +/*
> + * blk_mq_map_hk_queues - Create housekeeping CPU to hardware queue mapping
> + * @qmap:	CPU to hardware queue map
> + *
> + * Create a housekeeping CPU to hardware queue mapping in @qmap. If the
> + * isolcpus feature is enabled and blk_mq_map_hk_queues returns true,
> + * @qmap contains a valid configuration honoring the managed_irq
> + * configuration. If the isolcpus feature is disabled this function
> + * returns false.
> + */
> +static bool blk_mq_map_hk_queues(struct blk_mq_queue_map *qmap)
> +{
> +	struct cpumask *hk_masks;
> +	cpumask_var_t isol_mask;
> +	unsigned int queue, cpu, nr_masks;
> +
> +	if (!housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
> +		return false;
> +
> +	/* map housekeeping cpus to matching hardware context */
> +	nr_masks = qmap->nr_queues;
> +	hk_masks = group_cpus_evenly(&nr_masks);
> +	if (!hk_masks)
> +		goto fallback;
> +
> +	for (queue = 0; queue < qmap->nr_queues; queue++) {
> +		for_each_cpu(cpu, &hk_masks[queue % nr_masks])
> +			qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +	}
> +
> +	kfree(hk_masks);
> +
> +	/* map isolcpus to hardware context */
> +	if (!alloc_cpumask_var(&isol_mask, GFP_KERNEL))
> +		goto fallback;
> +
> +	queue = 0;
> +	cpumask_andnot(isol_mask,
> +		       cpu_possible_mask,
> +		       housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
> +
> +	for_each_cpu(cpu, isol_mask) {
> +		qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +		queue = (queue + 1) % qmap->nr_queues;
> +	}

Looks the IO hang issue in V3 isn't addressed yet, is it?

https://lore.kernel.org/linux-block/ZrtX4pzqwVUEgIPS@fedora/


Thanks,
Ming


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 8/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled
  2024-12-19  9:20   ` Ming Lei
@ 2024-12-19 15:38     ` Daniel Wagner
  2024-12-20  8:54       ` Ming Lei
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2024-12-19 15:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Kashyap Desai, Sumit Saxena, Shivasharan S,
	Chandrakanth patil, Martin K. Petersen, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Don Brace, Michael S. Tsirkin,
	Jason Wang, Paolo Bonzini, Stefan Hajnoczi, Eugenio Pérez,
	Xuan Zhuo, Andrew Morton, Thomas Gleixner, Costa Shulyupin,
	Juri Lelli, Valentin Schneider, Waiman Long, Michal Koutný,
	Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization

On Thu, Dec 19, 2024 at 05:20:44PM +0800, Ming Lei wrote:
> > +	cpumask_andnot(isol_mask,
> > +		       cpu_possible_mask,
> > +		       housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
> > +
> > +	for_each_cpu(cpu, isol_mask) {
> > +		qmap->mq_map[cpu] = qmap->queue_offset + queue;
> > +		queue = (queue + 1) % qmap->nr_queues;
> > +	}
> 
> Looks the IO hang issue in V3 isn't addressed yet, is it?
> 
> https://lore.kernel.org/linux-block/ZrtX4pzqwVUEgIPS@fedora/

I've added an explanation in the cover letter why this is not
addressed. From the cover letter:

I've experimented for a while and all solutions I came up were horrible
hacks (the hotpath needs to be touched) and I don't want to slow down all
other users (which are almost everyone). IMO, it's just not worth trying
to fix this corner case. If the user is using isolcpus and does CPU
hotplug, we can expect that the user can also first offline the isolated
CPUs. I've discussed this topic during ALPSS and the room came to the
same conclusion. Thus I just added a patch which issues a warning that
IOs are likely to hang.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 8/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled
  2024-12-19 15:38     ` Daniel Wagner
@ 2024-12-20  8:54       ` Ming Lei
  2025-01-10  9:21         ` Daniel Wagner
  0 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2024-12-20  8:54 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Kashyap Desai, Sumit Saxena, Shivasharan S,
	Chandrakanth patil, Martin K. Petersen, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Don Brace, Michael S. Tsirkin,
	Jason Wang, Paolo Bonzini, Stefan Hajnoczi, Eugenio Pérez,
	Xuan Zhuo, Andrew Morton, Thomas Gleixner, Costa Shulyupin,
	Juri Lelli, Valentin Schneider, Waiman Long, Michal Koutný,
	Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization

On Thu, Dec 19, 2024 at 04:38:43PM +0100, Daniel Wagner wrote:

> When isolcpus=managed_irq is enabled all hardware queues should run on
> the housekeeping CPUs only. Thus ignore the affinity mask provided by
> the driver.

Compared with in-tree code, the above words are misleading.

- irq core code respects isolated CPUs by trying to exclude isolated
CPUs from effective masks

- blk-mq won't schedule blockd on isolated CPUs

If application aren't run on isolated CPUs, IO interrupt usually won't
be triggered on isolated CPUs, so isolated CPUs are _not_ ignored.

> On Thu, Dec 19, 2024 at 05:20:44PM +0800, Ming Lei wrote:
> > > +	cpumask_andnot(isol_mask,
> > > +		       cpu_possible_mask,
> > > +		       housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
> > > +
> > > +	for_each_cpu(cpu, isol_mask) {
> > > +		qmap->mq_map[cpu] = qmap->queue_offset + queue;
> > > +		queue = (queue + 1) % qmap->nr_queues;
> > > +	}
> > 
> > Looks the IO hang issue in V3 isn't addressed yet, is it?
> > 
> > https://lore.kernel.org/linux-block/ZrtX4pzqwVUEgIPS@fedora/
> 
> I've added an explanation in the cover letter why this is not
> addressed. From the cover letter:
> 
> I've experimented for a while and all solutions I came up were horrible
> hacks (the hotpath needs to be touched) and I don't want to slow down all
> other users (which are almost everyone). IMO, it's just not worth trying

IMO, this patchset is one improvement on existed best-effort approach, which
works fine most of times, so why you do think it slows down everyone?

> to fix this corner case. If the user is using isolcpus and does CPU
> hotplug, we can expect that the user can also first offline the isolated
> CPUs. I've discussed this topic during ALPSS and the room came to the
> same conclusion. Thus I just added a patch which issues a warning that
> IOs are likely to hang.

If the change need userspace cooperation for using 'managed_irq', the exact
behavior need to be documented in both this commit and Documentation/admin-guide/kernel-parameters.txt,
instead of cover-letter only.

But this patch does cause regression for old applications which can't
follow the new introduced rule:

	```
	If the user is using isolcpus and does CPU hotplug, we can expect that the
	user can also first offline the isolated CPUs.
	```

Thanks,
Ming


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 9/9] blk-mq: issue warning when offlining hctx with online isolcpus
  2024-12-17 18:29 ` [PATCH v4 9/9] blk-mq: issue warning when offlining hctx with online isolcpus Daniel Wagner
  2024-12-19  6:28   ` Christoph Hellwig
@ 2024-12-20  9:04   ` Ming Lei
  1 sibling, 0 replies; 29+ messages in thread
From: Ming Lei @ 2024-12-20  9:04 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Martin K. Petersen, Nilesh Javali, GR-QLogic-Storage-Upstream,
	Don Brace, Michael S. Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Pérez, Xuan Zhuo, Andrew Morton,
	Thomas Gleixner, Costa Shulyupin, Juri Lelli, Valentin Schneider,
	Waiman Long, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization

On Tue, Dec 17, 2024 at 07:29:43PM +0100, Daniel Wagner wrote:
> When we offlining a hardware context which also serves isolcpus mapped
> to it, any IO issued by the isolcpus will stall as there is nothing
> which handles the interrupts etc.
> 
> This configuration/setup is not supported at this point thus just issue
> a warning.

As I mentioned on patch 8, this io hang is regression on existed
applications which can work just fine with 'isolcpus=managed_irq'.

Do you think the added warning will prevent people from complaining
the regression? :-)


Thanks,
Ming


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/9] lib/group_cpus: let group_cpu_evenly return number of groups
  2024-12-17 18:29 ` [PATCH v4 1/9] lib/group_cpus: let group_cpu_evenly return number of groups Daniel Wagner
@ 2025-01-07  7:51   ` Hannes Reinecke
  2025-01-07  8:20     ` Daniel Wagner
  0 siblings, 1 reply; 29+ messages in thread
From: Hannes Reinecke @ 2025-01-07  7:51 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Kashyap Desai, Sumit Saxena, Shivasharan S,
	Chandrakanth patil, Martin K. Petersen, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Don Brace, Michael S. Tsirkin,
	Jason Wang, Paolo Bonzini, Stefan Hajnoczi, Eugenio Pérez,
	Xuan Zhuo, Andrew Morton, Thomas Gleixner
  Cc: Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Sridhar Balaraman, brookxu.cn, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization

On 12/17/24 19:29, Daniel Wagner wrote:
> group_cpu_evenly might allocated less groups then the requested:
> 
> group_cpu_evenly
>    __group_cpus_evenly
>      alloc_nodes_groups
>        # allocated total groups may be less than numgrps when
>        # active total CPU number is less then numgrps
> 
> In this case, the caller will do an out of bound access because the
> caller assumes the masks returned has numgrps.
> 
> Return the number of groups created so the caller can limit the access
> range accordingly.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   block/blk-mq-cpumap.c        |  7 ++++---
>   drivers/virtio/virtio_vdpa.c |  2 +-
>   fs/fuse/virtio_fs.c          |  7 ++++---
>   include/linux/group_cpus.h   |  2 +-
>   kernel/irq/affinity.c        |  2 +-
>   lib/group_cpus.c             | 23 +++++++++++++----------
>   6 files changed, 24 insertions(+), 19 deletions(-)
> 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index ad8d6a363f24ae11968b42f7bcfd6a719a0499b7..85c0a7073bd8bff5d34aad1729d45d89da4c4bd1 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -19,9 +19,10 @@
>   void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
>   {
>   	const struct cpumask *masks;
> -	unsigned int queue, cpu;
> +	unsigned int queue, cpu, nr_masks;
>   
> -	masks = group_cpus_evenly(qmap->nr_queues);
> +	nr_masks = qmap->nr_queues;
> +	masks = group_cpus_evenly(&nr_masks);

Hmph. I am a big fan of separating input and output paramenters;
most ABI definitions will be doing that anyway.
Makes it also really hard to track whether the output parameters
had been set at all. Care to split it up?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/9] lib/group_cpus: let group_cpu_evenly return number of groups
  2025-01-07  7:51   ` Hannes Reinecke
@ 2025-01-07  8:20     ` Daniel Wagner
  2025-01-07 10:35       ` Hannes Reinecke
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2025-01-07  8:20 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Kashyap Desai, Sumit Saxena, Shivasharan S,
	Chandrakanth patil, Martin K. Petersen, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Don Brace, Michael S. Tsirkin,
	Jason Wang, Paolo Bonzini, Stefan Hajnoczi, Eugenio Pérez,
	Xuan Zhuo, Andrew Morton, Thomas Gleixner, Costa Shulyupin,
	Juri Lelli, Valentin Schneider, Waiman Long, Ming Lei,
	Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Sridhar Balaraman, brookxu.cn, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization

On Tue, Jan 07, 2025 at 08:51:57AM +0100, Hannes Reinecke wrote:
> >   void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
> >   {
> >   	const struct cpumask *masks;
> > -	unsigned int queue, cpu;
> > +	unsigned int queue, cpu, nr_masks;
> > -	masks = group_cpus_evenly(qmap->nr_queues);
> > +	nr_masks = qmap->nr_queues;
> > +	masks = group_cpus_evenly(&nr_masks);
> 
> Hmph. I am a big fan of separating input and output paramenters;
> most ABI definitions will be doing that anyway.
> Makes it also really hard to track whether the output parameters
> had been set at all. Care to split it up?

What API do you have in mind?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/9] lib/group_cpus: let group_cpu_evenly return number of groups
  2025-01-07  8:20     ` Daniel Wagner
@ 2025-01-07 10:35       ` Hannes Reinecke
  2025-01-07 10:46         ` Daniel Wagner
  0 siblings, 1 reply; 29+ messages in thread
From: Hannes Reinecke @ 2025-01-07 10:35 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Kashyap Desai, Sumit Saxena, Shivasharan S,
	Chandrakanth patil, Martin K. Petersen, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Don Brace, Michael S. Tsirkin,
	Jason Wang, Paolo Bonzini, Stefan Hajnoczi, Eugenio Pérez,
	Xuan Zhuo, Andrew Morton, Thomas Gleixner, Costa Shulyupin,
	Juri Lelli, Valentin Schneider, Waiman Long, Ming Lei,
	Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Sridhar Balaraman, brookxu.cn, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization

On 1/7/25 09:20, Daniel Wagner wrote:
> On Tue, Jan 07, 2025 at 08:51:57AM +0100, Hannes Reinecke wrote:
>>>    void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
>>>    {
>>>    	const struct cpumask *masks;
>>> -	unsigned int queue, cpu;
>>> +	unsigned int queue, cpu, nr_masks;
>>> -	masks = group_cpus_evenly(qmap->nr_queues);
>>> +	nr_masks = qmap->nr_queues;
>>> +	masks = group_cpus_evenly(&nr_masks);
>>
>> Hmph. I am a big fan of separating input and output paramenters;
>> most ABI definitions will be doing that anyway.
>> Makes it also really hard to track whether the output parameters
>> had been set at all. Care to split it up?
> 
> What API do you have in mind?

ABI, not API.
x86 ABI has 'rdi' as the first input register, and 'rax' as the first 
output register. So there is no difference for the compiler whether
you have 'x(&a); b = a;' or 'x(a, &b)'.
But you gain on usability, as in the second case you can check whether
'b' had been set by the function.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/9] lib/group_cpus: let group_cpu_evenly return number of groups
  2025-01-07 10:35       ` Hannes Reinecke
@ 2025-01-07 10:46         ` Daniel Wagner
  2025-01-07 11:35           ` Hannes Reinecke
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2025-01-07 10:46 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Kashyap Desai, Sumit Saxena, Shivasharan S,
	Chandrakanth patil, Martin K. Petersen, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Don Brace, Michael S. Tsirkin,
	Jason Wang, Paolo Bonzini, Stefan Hajnoczi, Eugenio Pérez,
	Xuan Zhuo, Andrew Morton, Thomas Gleixner, Costa Shulyupin,
	Juri Lelli, Valentin Schneider, Waiman Long, Ming Lei,
	Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Sridhar Balaraman, brookxu.cn, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization

On Tue, Jan 07, 2025 at 11:35:10AM +0100, Hannes Reinecke wrote:
> On 1/7/25 09:20, Daniel Wagner wrote:
> > On Tue, Jan 07, 2025 at 08:51:57AM +0100, Hannes Reinecke wrote:
> > > >    void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
> > > >    {
> > > >    	const struct cpumask *masks;
> > > > -	unsigned int queue, cpu;
> > > > +	unsigned int queue, cpu, nr_masks;
> > > > -	masks = group_cpus_evenly(qmap->nr_queues);
> > > > +	nr_masks = qmap->nr_queues;
> > > > +	masks = group_cpus_evenly(&nr_masks);
> > > 
> > > Hmph. I am a big fan of separating input and output paramenters;
> > > most ABI definitions will be doing that anyway.
> > > Makes it also really hard to track whether the output parameters
> > > had been set at all. Care to split it up?
> > 
> > What API do you have in mind?
> 
> ABI, not API.

I got that, still what C API do you want to see?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/9] lib/group_cpus: let group_cpu_evenly return number of groups
  2025-01-07 10:46         ` Daniel Wagner
@ 2025-01-07 11:35           ` Hannes Reinecke
  0 siblings, 0 replies; 29+ messages in thread
From: Hannes Reinecke @ 2025-01-07 11:35 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Kashyap Desai, Sumit Saxena, Shivasharan S,
	Chandrakanth patil, Martin K. Petersen, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Don Brace, Michael S. Tsirkin,
	Jason Wang, Paolo Bonzini, Stefan Hajnoczi, Eugenio Pérez,
	Xuan Zhuo, Andrew Morton, Thomas Gleixner, Costa Shulyupin,
	Juri Lelli, Valentin Schneider, Waiman Long, Ming Lei,
	Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Sridhar Balaraman, brookxu.cn, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization

On 1/7/25 11:46, Daniel Wagner wrote:
> On Tue, Jan 07, 2025 at 11:35:10AM +0100, Hannes Reinecke wrote:
>> On 1/7/25 09:20, Daniel Wagner wrote:
>>> On Tue, Jan 07, 2025 at 08:51:57AM +0100, Hannes Reinecke wrote:
>>>>>     void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
>>>>>     {
>>>>>     	const struct cpumask *masks;
>>>>> -	unsigned int queue, cpu;
>>>>> +	unsigned int queue, cpu, nr_masks;
>>>>> -	masks = group_cpus_evenly(qmap->nr_queues);
>>>>> +	nr_masks = qmap->nr_queues;
>>>>> +	masks = group_cpus_evenly(&nr_masks);
>>>>
>>>> Hmph. I am a big fan of separating input and output paramenters;
>>>> most ABI definitions will be doing that anyway.
>>>> Makes it also really hard to track whether the output parameters
>>>> had been set at all. Care to split it up?
>>>
>>> What API do you have in mind?
>>
>> ABI, not API.
> 
> I got that, still what C API do you want to see?
> 

masks = group_cpus_evenly(qmap->nr_queues, &nr_queues);

maybe?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/9] sched/isolation: document HK_TYPE housekeeping option
  2024-12-17 18:29 ` [PATCH v4 2/9] sched/isolation: document HK_TYPE housekeeping option Daniel Wagner
@ 2025-01-07 15:39   ` Waiman Long
  0 siblings, 0 replies; 29+ messages in thread
From: Waiman Long @ 2025-01-07 15:39 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Kashyap Desai, Sumit Saxena, Shivasharan S,
	Chandrakanth patil, Martin K. Petersen, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Don Brace, Michael S. Tsirkin,
	Jason Wang, Paolo Bonzini, Stefan Hajnoczi, Eugenio Pérez,
	Xuan Zhuo, Andrew Morton, Thomas Gleixner
  Cc: Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization


On 12/17/24 1:29 PM, Daniel Wagner wrote:
> The enum is a public API which can be used all over the kernel. This
> warrants a bit of documentation.
>
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   include/linux/sched/isolation.h | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
>
> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> index 2b461129d1fad0fd0ef1ad759fe44695dc635e8c..6649c3a48e0ea0a88c84bf5f2a74bff039fadaf2 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -6,6 +6,19 @@
>   #include <linux/init.h>
>   #include <linux/tick.h>
>   
> +/**
> + * enum hk_type - housekeeping cpu mask types
> + * @HK_TYPE_TIMER:	housekeeping cpu mask for timers
> + * @HK_TYPE_RCU:	housekeeping cpu mask for RCU
> + * @HK_TYPE_MISC:	housekeeping cpu mask for miscalleanous resources
> + * @HK_TYPE_SCHED:	housekeeping cpu mask for scheduling
> + * @HK_TYPE_TICK:	housekeeping cpu maks for timer tick
> + * @HK_TYPE_DOMAIN:	housekeeping cpu mask for general SMP balancing
> + *			and scheduling algoririthms
> + * @HK_TYPE_WQ:		housekeeping cpu mask for worksqueues
> + * @HK_TYPE_MANAGED_IRQ: housekeeping cpu mask for managed IRQs
> + * @HK_TYPE_KTHREAD:	housekeeping cpu mask for kthreads
> + */
>   enum hk_type {
>   	HK_TYPE_TIMER,
>   	HK_TYPE_RCU,
>
The various housekeeping types are in the process of being consolidated 
as most of them cannot be set independently. See commit 6010d245ddc9 
("sched/isolation: Consolidate housekeeping cpumasks that are always 
identical") in linux-next or tip. So this patch will have conflict.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/9] blk-mq: add number of queue calc helper
  2024-12-17 18:29 ` [PATCH v4 3/9] blk-mq: add number of queue calc helper Daniel Wagner
@ 2025-01-08  7:04   ` Hannes Reinecke
  0 siblings, 0 replies; 29+ messages in thread
From: Hannes Reinecke @ 2025-01-08  7:04 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Kashyap Desai, Sumit Saxena, Shivasharan S,
	Chandrakanth patil, Martin K. Petersen, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Don Brace, Michael S. Tsirkin,
	Jason Wang, Paolo Bonzini, Stefan Hajnoczi, Eugenio Pérez,
	Xuan Zhuo, Andrew Morton, Thomas Gleixner
  Cc: Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Sridhar Balaraman, brookxu.cn, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization

On 12/17/24 19:29, Daniel Wagner wrote:
> Multiqueue devices should only allocate queues for the housekeeping CPUs
> when isolcpus=managed_irq is set. This avoids that the isolated CPUs get
> disturbed with OS workload.
> 
> Add two variants of helpers which calculates the correct number of
> queues which should be used. The need for two variants is necessary
> because some drivers calculate their max number of queues based on the
> possible CPU mask, others based on the online CPU mask.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   block/blk-mq-cpumap.c  | 45 +++++++++++++++++++++++++++++++++++++++++++++
>   include/linux/blk-mq.h |  2 ++
>   2 files changed, 47 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/9] nvme-pci: use block layer helpers to calculate num of queues
  2024-12-17 18:29 ` [PATCH v4 4/9] nvme-pci: use block layer helpers to calculate num of queues Daniel Wagner
@ 2025-01-08  7:19   ` Hannes Reinecke
  0 siblings, 0 replies; 29+ messages in thread
From: Hannes Reinecke @ 2025-01-08  7:19 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Kashyap Desai, Sumit Saxena, Shivasharan S,
	Chandrakanth patil, Martin K. Petersen, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Don Brace, Michael S. Tsirkin,
	Jason Wang, Paolo Bonzini, Stefan Hajnoczi, Eugenio Pérez,
	Xuan Zhuo, Andrew Morton, Thomas Gleixner
  Cc: Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Michal Koutný, Frederic Weisbecker, Mel Gorman,
	Sridhar Balaraman, brookxu.cn, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization

On 12/17/24 19:29, Daniel Wagner wrote:
> Multiqueue devices should only allocate queues for the housekeeping CPUs
> when isolcpus=managed_irq is set. This avoids that the isolated CPUs get
> disturbed with OS workload.
> 
> Use helpers which calculates the correct number of queues which should
> be used when isolcpus is used.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   drivers/nvme/host/pci.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 8/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled
  2024-12-20  8:54       ` Ming Lei
@ 2025-01-10  9:21         ` Daniel Wagner
  2025-01-11  3:31           ` Ming Lei
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2025-01-10  9:21 UTC (permalink / raw)
  To: Ming Lei
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Kashyap Desai, Sumit Saxena, Shivasharan S,
	Chandrakanth patil, Martin K. Petersen, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Don Brace, Michael S. Tsirkin,
	Jason Wang, Paolo Bonzini, Stefan Hajnoczi, Eugenio Pérez,
	Xuan Zhuo, Andrew Morton, Thomas Gleixner, Costa Shulyupin,
	Juri Lelli, Valentin Schneider, Waiman Long, Michal Koutný,
	Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization

Hi Ming,

On Fri, Dec 20, 2024 at 04:54:21PM +0800, Ming Lei wrote:
> On Thu, Dec 19, 2024 at 04:38:43PM +0100, Daniel Wagner wrote:
> 
> > When isolcpus=managed_irq is enabled all hardware queues should run on
> > the housekeeping CPUs only. Thus ignore the affinity mask provided by
> > the driver.
> 
> Compared with in-tree code, the above words are misleading.
> 
> - irq core code respects isolated CPUs by trying to exclude isolated
> CPUs from effective masks
> 
> - blk-mq won't schedule blockd on isolated CPUs

I see your point, the commit should highlight the fact when an
application is issuing an I/O, this can lead to stalls.

What about a commit message like:

  When isolcpus=managed_irq is enabled, and the last housekeeping CPU for
  a given hardware context goes offline, there is no CPU left which
  handles the IOs anymore. If isolated CPUs mapped to this hardware
  context are online and an application running on these isolated CPUs
  issue an IO this will lead to stalls.

  The kernel will not schedule IO to isolated CPUS thus this avoids IO
  stalls.

  Thus issue a warning when housekeeping CPUs are offlined for a hardware
  context while there are still isolated CPUs online.

> If application aren't run on isolated CPUs, IO interrupt usually won't
> be triggered on isolated CPUs, so isolated CPUs are _not_ ignored.

FWIW, the 'usually' part is what made our customers nervous. They saw
some IRQ noise on the isolated CPUs with their workload and reported
with these changes all was good. Unfortunately, we never got the hands
on the workload, hard to say what was causing it.

> > On Thu, Dec 19, 2024 at 05:20:44PM +0800, Ming Lei wrote:
> > > > +	cpumask_andnot(isol_mask,
> > > > +		       cpu_possible_mask,
> > > > +		       housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
> > > > +
> > > > +	for_each_cpu(cpu, isol_mask) {
> > > > +		qmap->mq_map[cpu] = qmap->queue_offset + queue;
> > > > +		queue = (queue + 1) % qmap->nr_queues;
> > > > +	}
> > > 
> > > Looks the IO hang issue in V3 isn't addressed yet, is it?
> > > 
> > > https://lore.kernel.org/linux-block/ZrtX4pzqwVUEgIPS@fedora/
> > 
> > I've added an explanation in the cover letter why this is not
> > addressed. From the cover letter:
> > 
> > I've experimented for a while and all solutions I came up were horrible
> > hacks (the hotpath needs to be touched) and I don't want to slow down all
> > other users (which are almost everyone). IMO, it's just not worth trying
> 
> IMO, this patchset is one improvement on existed best-effort approach, which
> works fine most of times, so why you do think it slows down everyone?

I was talking about implementing the feature which would remap the
isolated CPUs to online hardware context when the current hardware
context goes offline. I didn't find a solution which I think would be
worth presenting. All involved some sort of locking/refcounting in the
hotpath, which I think we should just avoid.

> > to fix this corner case. If the user is using isolcpus and does CPU
> > hotplug, we can expect that the user can also first offline the isolated
> > CPUs. I've discussed this topic during ALPSS and the room came to the
> > same conclusion. Thus I just added a patch which issues a warning that
> > IOs are likely to hang.
> 
> If the change need userspace cooperation for using 'managed_irq', the exact
> behavior need to be documented in both this commit and Documentation/admin-guide/kernel-parameters.txt,
> instead of cover-letter only.
> 
> But this patch does cause regression for old applications which can't
> follow the new introduced rule:
> 
> 	```
> 	If the user is using isolcpus and does CPU hotplug, we can expect that the
> 	user can also first offline the isolated CPUs.
> 	```

Indeed, I forgot to update the documentation. I'll update it accordingly.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 8/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled
  2025-01-10  9:21         ` Daniel Wagner
@ 2025-01-11  3:31           ` Ming Lei
  2025-01-13 13:19             ` Daniel Wagner
  0 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2025-01-11  3:31 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Kashyap Desai, Sumit Saxena, Shivasharan S,
	Chandrakanth patil, Martin K. Petersen, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Don Brace, Michael S. Tsirkin,
	Jason Wang, Paolo Bonzini, Stefan Hajnoczi, Eugenio Pérez,
	Xuan Zhuo, Andrew Morton, Thomas Gleixner, Costa Shulyupin,
	Juri Lelli, Valentin Schneider, Waiman Long, Michal Koutný,
	Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization

On Fri, Jan 10, 2025 at 10:21:49AM +0100, Daniel Wagner wrote:
> Hi Ming,
> 
> On Fri, Dec 20, 2024 at 04:54:21PM +0800, Ming Lei wrote:
> > On Thu, Dec 19, 2024 at 04:38:43PM +0100, Daniel Wagner wrote:
> > 
> > > When isolcpus=managed_irq is enabled all hardware queues should run on
> > > the housekeeping CPUs only. Thus ignore the affinity mask provided by
> > > the driver.
> > 
> > Compared with in-tree code, the above words are misleading.
> > 
> > - irq core code respects isolated CPUs by trying to exclude isolated
> > CPUs from effective masks
> > 
> > - blk-mq won't schedule blockd on isolated CPUs
> 
> I see your point, the commit should highlight the fact when an
> application is issuing an I/O, this can lead to stalls.
> 
> What about a commit message like:
> 
>   When isolcpus=managed_irq is enabled, and the last housekeeping CPU for
>   a given hardware context goes offline, there is no CPU left which
>   handles the IOs anymore. If isolated CPUs mapped to this hardware
>   context are online and an application running on these isolated CPUs
>   issue an IO this will lead to stalls.

It isn't correct, the in-tree code doesn't have such stall, no matter if
IO is issued from HK or isolated CPUs since the managed irq is guaranteed to
live if any mapped CPU is online.

> 
>   The kernel will not schedule IO to isolated CPUS thus this avoids IO
>   stalls.
> 
>   Thus issue a warning when housekeeping CPUs are offlined for a hardware
>   context while there are still isolated CPUs online.
> 
> > If application aren't run on isolated CPUs, IO interrupt usually won't
> > be triggered on isolated CPUs, so isolated CPUs are _not_ ignored.
> 
> FWIW, the 'usually' part is what made our customers nervous. They saw
> some IRQ noise on the isolated CPUs with their workload and reported
> with these changes all was good. Unfortunately, we never got the hands
> on the workload, hard to say what was causing it.

Please see irq_do_set_affinity():

        if (irqd_affinity_is_managed(data) &&
              housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
                  const struct cpumask *hk_mask;

                  hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);

                  cpumask_and(&tmp_mask, mask, hk_mask);
                  if (!cpumask_intersects(&tmp_mask, cpu_online_mask))
                          prog_mask = mask;
                  else
                          prog_mask = &tmp_mask;
          } else {
                  prog_mask = mask;

The whole mask which may include isolated CPUs is only programmed to
hardware if there isn't any online CPU in `irq_mask & hk_mask`.

> 
> > > On Thu, Dec 19, 2024 at 05:20:44PM +0800, Ming Lei wrote:
> > > > > +	cpumask_andnot(isol_mask,
> > > > > +		       cpu_possible_mask,
> > > > > +		       housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
> > > > > +
> > > > > +	for_each_cpu(cpu, isol_mask) {
> > > > > +		qmap->mq_map[cpu] = qmap->queue_offset + queue;
> > > > > +		queue = (queue + 1) % qmap->nr_queues;
> > > > > +	}
> > > > 
> > > > Looks the IO hang issue in V3 isn't addressed yet, is it?
> > > > 
> > > > https://lore.kernel.org/linux-block/ZrtX4pzqwVUEgIPS@fedora/
> > > 
> > > I've added an explanation in the cover letter why this is not
> > > addressed. From the cover letter:
> > > 
> > > I've experimented for a while and all solutions I came up were horrible
> > > hacks (the hotpath needs to be touched) and I don't want to slow down all
> > > other users (which are almost everyone). IMO, it's just not worth trying
> > 
> > IMO, this patchset is one improvement on existed best-effort approach, which
> > works fine most of times, so why you do think it slows down everyone?
> 
> I was talking about implementing the feature which would remap the
> isolated CPUs to online hardware context when the current hardware
> context goes offline. I didn't find a solution which I think would be
> worth presenting. All involved some sort of locking/refcounting in the
> hotpath, which I think we should just avoid.

I understand the trouble, but it is still one improvement from user
viewpoint instead of feature since the interface of 'isolcpus=manage_irq'
isn't changed.

> 
> > > to fix this corner case. If the user is using isolcpus and does CPU
> > > hotplug, we can expect that the user can also first offline the isolated
> > > CPUs. I've discussed this topic during ALPSS and the room came to the
> > > same conclusion. Thus I just added a patch which issues a warning that
> > > IOs are likely to hang.
> > 
> > If the change need userspace cooperation for using 'managed_irq', the exact
> > behavior need to be documented in both this commit and Documentation/admin-guide/kernel-parameters.txt,
> > instead of cover-letter only.
> > 
> > But this patch does cause regression for old applications which can't
> > follow the new introduced rule:
> > 
> > 	```
> > 	If the user is using isolcpus and does CPU hotplug, we can expect that the
> > 	user can also first offline the isolated CPUs.
> > 	```
> 
> Indeed, I forgot to update the documentation. I'll update it accordingly.

It isn't documentation thing, it breaks the no-regression policy, which crosses
our red-line.

If you really want to move on, please add one new kernel command
line with documenting the new usage which requires applications to
offline CPU in order.


thanks,
Ming


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 8/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled
  2025-01-11  3:31           ` Ming Lei
@ 2025-01-13 13:19             ` Daniel Wagner
  0 siblings, 0 replies; 29+ messages in thread
From: Daniel Wagner @ 2025-01-13 13:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Kashyap Desai, Sumit Saxena, Shivasharan S,
	Chandrakanth patil, Martin K. Petersen, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Don Brace, Michael S. Tsirkin,
	Jason Wang, Paolo Bonzini, Stefan Hajnoczi, Eugenio Pérez,
	Xuan Zhuo, Andrew Morton, Thomas Gleixner, Costa Shulyupin,
	Juri Lelli, Valentin Schneider, Waiman Long, Michal Koutný,
	Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization

Hi Ming,

On Sat, Jan 11, 2025 at 11:31:10AM +0800, Ming Lei wrote:
> > What about a commit message like:
> > 
> >   When isolcpus=managed_irq is enabled, and the last housekeeping CPU for
> >   a given hardware context goes offline, there is no CPU left which
> >   handles the IOs anymore. If isolated CPUs mapped to this hardware
> >   context are online and an application running on these isolated CPUs
> >   issue an IO this will lead to stalls.
> 
> It isn't correct, the in-tree code doesn't have such stall, no matter if
> IO is issued from HK or isolated CPUs since the managed irq is guaranteed to
> live if any mapped CPU is online.

Yes, it has different properties.

> Please see irq_do_set_affinity():
> 
>         if (irqd_affinity_is_managed(data) &&
>               housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
>                   const struct cpumask *hk_mask;
> 
>                   hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
> 
>                   cpumask_and(&tmp_mask, mask, hk_mask);
>                   if (!cpumask_intersects(&tmp_mask, cpu_online_mask))
>                           prog_mask = mask;
>                   else
>                           prog_mask = &tmp_mask;
>           } else {
>                   prog_mask = mask;
> 
> The whole mask which may include isolated CPUs is only programmed to
> hardware if there isn't any online CPU in `irq_mask & hk_mask`.

This is not what I try to achieve here. The main motivation with this
series is that isolated CPUs are never serving IRQs.

> > I was talking about implementing the feature which would remap the
> > isolated CPUs to online hardware context when the current hardware
> > context goes offline. I didn't find a solution which I think would be
> > worth presenting. All involved some sort of locking/refcounting in the
> > hotpath, which I think we should just avoid.
> 
> I understand the trouble, but it is still one improvement from user
> viewpoint instead of feature since the interface of 'isolcpus=manage_irq'
> isn't changed.

Ah, I understood you wrong. I didn't want to upset you. I thought you
were fine by changing how managed_irq works.

> > Indeed, I forgot to update the documentation. I'll update it accordingly.
> 
> It isn't documentation thing, it breaks the no-regression policy, which crosses
> our red-line.
> 
> If you really want to move on, please add one new kernel command
> line with documenting the new usage which requires applications to
> offline CPU in order.

Sure, I'll bring the separate command line option back.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-01-13 13:19 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-17 18:29 [PATCH v4 0/9] blk: honor isolcpus configuration Daniel Wagner
2024-12-17 18:29 ` [PATCH v4 1/9] lib/group_cpus: let group_cpu_evenly return number of groups Daniel Wagner
2025-01-07  7:51   ` Hannes Reinecke
2025-01-07  8:20     ` Daniel Wagner
2025-01-07 10:35       ` Hannes Reinecke
2025-01-07 10:46         ` Daniel Wagner
2025-01-07 11:35           ` Hannes Reinecke
2024-12-17 18:29 ` [PATCH v4 2/9] sched/isolation: document HK_TYPE housekeeping option Daniel Wagner
2025-01-07 15:39   ` Waiman Long
2024-12-17 18:29 ` [PATCH v4 3/9] blk-mq: add number of queue calc helper Daniel Wagner
2025-01-08  7:04   ` Hannes Reinecke
2024-12-17 18:29 ` [PATCH v4 4/9] nvme-pci: use block layer helpers to calculate num of queues Daniel Wagner
2025-01-08  7:19   ` Hannes Reinecke
2024-12-17 18:29 ` [PATCH v4 5/9] scsi: " Daniel Wagner
2024-12-17 18:29 ` [PATCH v4 6/9] virtio: blk/scsi: " Daniel Wagner
2024-12-19  6:25   ` Christoph Hellwig
2024-12-19  8:31   ` Michael S. Tsirkin
2024-12-17 18:29 ` [PATCH v4 7/9] lib/group_cpus: honor housekeeping config when grouping CPUs Daniel Wagner
2024-12-17 18:29 ` [PATCH v4 8/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled Daniel Wagner
2024-12-19  6:26   ` Christoph Hellwig
2024-12-19  9:20   ` Ming Lei
2024-12-19 15:38     ` Daniel Wagner
2024-12-20  8:54       ` Ming Lei
2025-01-10  9:21         ` Daniel Wagner
2025-01-11  3:31           ` Ming Lei
2025-01-13 13:19             ` Daniel Wagner
2024-12-17 18:29 ` [PATCH v4 9/9] blk-mq: issue warning when offlining hctx with online isolcpus Daniel Wagner
2024-12-19  6:28   ` Christoph Hellwig
2024-12-20  9:04   ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).