virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/9] blk: honor isolcpus configuration
@ 2025-01-10 16:26 Daniel Wagner
  2025-01-10 16:26 ` [PATCH v5 1/9] lib/group_cpus: let group_cpu_evenly return number initialized masks Daniel Wagner
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Daniel Wagner @ 2025-01-10 16:26 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

I've splitted the parameter into input and output variable for
group_cpu_evenly as requested by Hannes.

The commit message on the "lib/group_cpus: honor housekeeping config when
grouping CPUs" is hopefully not missleading anymore. I've took the
liberty to updated the managed_irq documentation with the aim to reduce
confusion. I got some feedback that the text is hard to understand.
Obviously I added the warning on possible IO stalls when doing cpu
hotplug.

Daniel

(trimmed the To/Cc list, it was a bit large.)

Initial cover letter:

The nvme-pci driver is ignoring the isolcpus configuration. There were
several attempts to fix this in the past [1][2]. This is another attempt
but this time trying to address the feedback and solve it in the core
code.

The first patch introduces a new option for isolcpus 'io_queue', but I'm
not really sure if this is needed and we could just use the managed_irq
option instead. I guess depends if there is an use case which depens on
queues on the isolated CPUs.

The second patch introduces a new block layer helper which returns the
number of possible queues. I suspect it would makes sense also to make
this helper a bit smarter and also consider the number of queues the
hardware supports.

And the last patch updates the group_cpus_evenly function so that it uses
only the housekeeping CPUs when they are defined

Note this series is not addressing the affinity setting of the admin
queue (queue 0). I'd like to address this after we agreed on how to solve
this. Currently, the admin queue affinity can be controlled by the
irq_afffinity command line option, so there is at least a workaround for
it.

Baseline:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 1536 MB
node 0 free: 1227 MB
node 1 cpus: 4 5 6 7
node 1 size: 1729 MB
node 1 free: 1422 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

options nvme write_queues=4 poll_queues=4

55: 0 41 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 0-edge nvme0q0 affinity: 0-3
63: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 1-edge nvme0q1 affinity: 4-5
64: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 2-edge nvme0q2 affinity: 6-7
65: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 3-edge nvme0q3 affinity: 0-1
66: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 4-edge nvme0q4 affinity: 2-3
67: 0 0 0 0 24 0 0 0 PCI-MSIX-0000:00:05.0 5-edge nvme0q5 affinity: 4
68: 0 0 0 0 0 1 0 0 PCI-MSIX-0000:00:05.0 6-edge nvme0q6 affinity: 5
69: 0 0 0 0 0 0 41 0 PCI-MSIX-0000:00:05.0 7-edge nvme0q7 affinity: 6
70: 0 0 0 0 0 0 0 3 PCI-MSIX-0000:00:05.0 8-edge nvme0q8 affinity: 7
71: 1 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 9-edge nvme0q9 affinity: 0
72: 0 18 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 10-edge nvme0q10 affinity: 1
73: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 11-edge nvme0q11 affinity: 2
74: 0 0 0 3 0 0 0 0 PCI-MSIX-0000:00:05.0 12-edge nvme0q12 affinity: 3

queue mapping for /dev/nvme0n1
        hctx0: default 4 5
        hctx1: default 6 7
        hctx2: default 0 1
        hctx3: default 2 3
        hctx4: read 4
        hctx5: read 5
        hctx6: read 6
        hctx7: read 7
        hctx8: read 0
        hctx9: read 1
        hctx10: read 2
        hctx11: read 3
        hctx12: poll 4 5
        hctx13: poll 6 7
        hctx14: poll 0 1
        hctx15: poll 2 3

PCI name is 00:05.0: nvme0n1
        irq 55, cpu list 0-3, effective list 1
        irq 63, cpu list 4-5, effective list 5
        irq 64, cpu list 6-7, effective list 7
        irq 65, cpu list 0-1, effective list 1
        irq 66, cpu list 2-3, effective list 3
        irq 67, cpu list 4, effective list 4
        irq 68, cpu list 5, effective list 5
        irq 69, cpu list 6, effective list 6
        irq 70, cpu list 7, effective list 7
        irq 71, cpu list 0, effective list 0
        irq 72, cpu list 1, effective list 1
        irq 73, cpu list 2, effective list 2
        irq 74, cpu list 3, effective list 3

* patched:

48: 0 0 33 0 0 0 0 0 PCI-MSIX-0000:00:05.0 0-edge nvme0q0 affinity: 0-3
58: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 1-edge nvme0q1 affinity: 4
59: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 2-edge nvme0q2 affinity: 5
60: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 3-edge nvme0q3 affinity: 0
61: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 4-edge nvme0q4 affinity: 1
62: 0 0 0 0 45 0 0 0 PCI-MSIX-0000:00:05.0 5-edge nvme0q5 affinity: 4
63: 0 0 0 0 0 12 0 0 PCI-MSIX-0000:00:05.0 6-edge nvme0q6 affinity: 5
64: 2 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 7-edge nvme0q7 affinity: 0
65: 0 35 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 8-edge nvme0q8 affinity: 1

queue mapping for /dev/nvme0n1
        hctx0: default 2 3 4 6 7
        hctx1: default 5
        hctx2: default 0
        hctx3: default 1
        hctx4: read 4
        hctx5: read 5
        hctx6: read 0
        hctx7: read 1
        hctx8: poll 4
        hctx9: poll 5
        hctx10: poll 0
        hctx11: poll 1

PCI name is 00:05.0: nvme0n1
        irq 48, cpu list 0-3, effective list 2
        irq 58, cpu list 4, effective list 4
        irq 59, cpu list 5, effective list 5
        irq 60, cpu list 0, effective list 0
        irq 61, cpu list 1, effective list 1
        irq 62, cpu list 4, effective list 4
        irq 63, cpu list 5, effective list 5
        irq 64, cpu list 0, effective list 0
        irq 65, cpu list 1, effective list 1

[1] https://lore.kernel.org/lkml/20220423054331.GA17823@lst.de/T/#m9939195a465accbf83187caf346167c4242e798d
[2] https://lore.kernel.org/linux-nvme/87fruci5nj.ffs@tglx/
[3] https://lore.kernel.org/linux-nvme/20241202-refactor-blk-affinity-helpers-v6-0-27211e9c2cd5@kernel.org/

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
Changes in v5:
- rebased on latest for-6.14/block
- udpated documetation on managed_irq
- updated commit message "blk-mq: issue warning when offlining hctx with online isolcpus"
- split input/output parameter in "lib/group_cpus: let group_cpu_evenly return number of groups"
- dropped "sched/isolation: document HK_TYPE housekeeping option"
- Link to v4: https://lore.kernel.org/r/20241217-isolcpus-io-queues-v4-0-5d355fbb1e14@kernel.org

Changes in v4:
- added "blk-mq: issue warning when offlining hctx with online isolcpus"
- fixed check in cgroup_cpus_evenly, the if condition needs to use
  housekeeping_enabled() and not cpusmask_weight(housekeeping_masks),
  because the later will always return a valid mask.
- dropped fixed tag from "lib/group_cpus.c: honor housekeeping config when
  grouping CPUs"
- fixed overlong line "scsi: use block layer helpers to calculate num
  of queues"
- dropped "sched/isolation: Add io_queue housekeeping option",
  just document the housekeep enum hk_type
- added "lib/group_cpus: let group_cpu_evenly return number of groups"
- collected tags
- splitted series into a preperation series:
  https://lore.kernel.org/linux-nvme/20241202-refactor-blk-affinity-helpers-v6-0-27211e9c2cd5@kernel.org/
- Link to v3: https://lore.kernel.org/r/20240806-isolcpus-io-queues-v3-0-da0eecfeaf8b@suse.de

Changes in v3:
- lifted a couple of patches from
  https://lore.kernel.org/all/20210709081005.421340-1-ming.lei@redhat.com/
  "virito: add APIs for retrieving vq affinity"
  "blk-mq: introduce blk_mq_dev_map_queues"
- replaces all users of blk_mq_[pci|virtio]_map_queues with
  blk_mq_dev_map_queues
- updated/extended number of queue calc helpers
- add isolcpus=io_queue CPU-hctx mapping function
- documented enum hk_type and isolcpus=io_queue
- added "scsi: pm8001: do not overwrite PCI queue mapping"
- Link to v2: https://lore.kernel.org/r/20240627-isolcpus-io-queues-v2-0-26a32e3c4f75@suse.de

Changes in v2:
- updated documentation
- splitted blk/nvme-pci patch
- dropped HK_TYPE_IO_QUEUE, use HK_TYPE_MANAGED_IRQ
- Link to v1: https://lore.kernel.org/r/20240621-isolcpus-io-queues-v1-0-8b169bf41083@suse.de

---
Daniel Wagner (9):
      lib/group_cpus: let group_cpu_evenly return number initialized masks
      blk-mq: add number of queue calc helper
      nvme-pci: use block layer helpers to calculate num of queues
      scsi: use block layer helpers to calculate num of queues
      virtio: blk/scsi: use block layer helpers to calculate num of queues
      lib/group_cpus: honor housekeeping config when grouping CPUs
      blk-mq: use hk cpus only when isolcpus=managed_irq is enabled
      blk-mq: issue warning when offlining hctx with online isolcpus
      doc: update managed_irq documentation

 Documentation/admin-guide/kernel-parameters.txt |  46 +++++-----
 block/blk-mq-cpumap.c                           | 116 +++++++++++++++++++++++-
 block/blk-mq.c                                  |  43 ++++++++-
 drivers/block/virtio_blk.c                      |   5 +-
 drivers/nvme/host/pci.c                         |   5 +-
 drivers/scsi/megaraid/megaraid_sas_base.c       |  15 +--
 drivers/scsi/qla2xxx/qla_isr.c                  |  10 +-
 drivers/scsi/smartpqi/smartpqi_init.c           |   5 +-
 drivers/scsi/virtio_scsi.c                      |   1 +
 drivers/virtio/virtio_vdpa.c                    |   9 +-
 fs/fuse/virtio_fs.c                             |   6 +-
 include/linux/blk-mq.h                          |   2 +
 include/linux/group_cpus.h                      |   3 +-
 kernel/irq/affinity.c                           |   9 +-
 lib/group_cpus.c                                |  90 +++++++++++++++++-
 15 files changed, 304 insertions(+), 61 deletions(-)
---
base-commit: 844b8cdc681612ff24df62cdefddeab5772fadf1
change-id: 20240620-isolcpus-io-queues-1a88eb47ff8b

Best regards,
-- 
Daniel Wagner <wagi@kernel.org>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v5 1/9] lib/group_cpus: let group_cpu_evenly return number initialized masks
  2025-01-10 16:26 [PATCH v5 0/9] blk: honor isolcpus configuration Daniel Wagner
@ 2025-01-10 16:26 ` Daniel Wagner
  2025-01-13  7:07   ` Hannes Reinecke
  2025-01-10 16:26 ` [PATCH v5 2/9] blk-mq: add number of queue calc helper Daniel Wagner
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Daniel Wagner @ 2025-01-10 16:26 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

group_cpu_evenly might allocated less groups then the requested:

group_cpu_evenly
  __group_cpus_evenly
    alloc_nodes_groups
      # allocated total groups may be less than numgrps when
      # active total CPU number is less then numgrps

In this case, the caller will do an out of bound access because the
caller assumes the masks returned has numgrps.

Return the number of groups created so the caller can limit the access
range accordingly.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq-cpumap.c        |  6 +++---
 drivers/virtio/virtio_vdpa.c |  9 +++++----
 fs/fuse/virtio_fs.c          |  6 +++---
 include/linux/group_cpus.h   |  3 ++-
 kernel/irq/affinity.c        |  9 +++++----
 lib/group_cpus.c             | 12 +++++++++---
 6 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index ad8d6a363f24ae11968b42f7bcfd6a719a0499b7..7d3dfe885dfac18711ae73eff510efe3877ffcb6 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -19,9 +19,9 @@
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
 	const struct cpumask *masks;
-	unsigned int queue, cpu;
+	unsigned int queue, cpu, nr_masks;
 
-	masks = group_cpus_evenly(qmap->nr_queues);
+	masks = group_cpus_evenly(qmap->nr_queues, &nr_masks);
 	if (!masks) {
 		for_each_possible_cpu(cpu)
 			qmap->mq_map[cpu] = qmap->queue_offset;
@@ -29,7 +29,7 @@ void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 	}
 
 	for (queue = 0; queue < qmap->nr_queues; queue++) {
-		for_each_cpu(cpu, &masks[queue])
+		for_each_cpu(cpu, &masks[queue % nr_masks])
 			qmap->mq_map[cpu] = qmap->queue_offset + queue;
 	}
 	kfree(masks);
diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 1f60c9d5cb1810a6f208c24bb2ac640d537391a0..a7b297dae4890c9d6002744b90fc133bbedb7b44 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -329,20 +329,21 @@ create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 
 	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
 		unsigned int this_vecs = affd->set_size[i];
+		unsigned int nr_masks;
 		int j;
-		struct cpumask *result = group_cpus_evenly(this_vecs);
+		struct cpumask *result = group_cpus_evenly(this_vecs, &nr_masks);
 
 		if (!result) {
 			kfree(masks);
 			return NULL;
 		}
 
-		for (j = 0; j < this_vecs; j++)
+		for (j = 0; j < nr_masks; j++)
 			cpumask_copy(&masks[curvec + j], &result[j]);
 		kfree(result);
 
-		curvec += this_vecs;
-		usedvecs += this_vecs;
+		curvec += nr_masks;
+		usedvecs += nr_masks;
 	}
 
 	/* Fill out vectors at the end that don't need affinity */
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 82afe78ec542358e2db6f4d955d521652ae363ec..47412bd40285a28d0dd61e4b3dabc59d5a1ba54e 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -862,7 +862,7 @@ static void virtio_fs_requests_done_work(struct work_struct *work)
 static void virtio_fs_map_queues(struct virtio_device *vdev, struct virtio_fs *fs)
 {
 	const struct cpumask *mask, *masks;
-	unsigned int q, cpu;
+	unsigned int q, cpu, nr_masks;
 
 	/* First attempt to map using existing transport layer affinities
 	 * e.g. PCIe MSI-X
@@ -882,7 +882,7 @@ static void virtio_fs_map_queues(struct virtio_device *vdev, struct virtio_fs *f
 	return;
 fallback:
 	/* Attempt to map evenly in groups over the CPUs */
-	masks = group_cpus_evenly(fs->num_request_queues);
+	masks = group_cpus_evenly(fs->num_request_queues, &nr_masks);
 	/* If even this fails we default to all CPUs use first request queue */
 	if (!masks) {
 		for_each_possible_cpu(cpu)
@@ -891,7 +891,7 @@ static void virtio_fs_map_queues(struct virtio_device *vdev, struct virtio_fs *f
 	}
 
 	for (q = 0; q < fs->num_request_queues; q++) {
-		for_each_cpu(cpu, &masks[q])
+		for_each_cpu(cpu, &masks[q % nr_masks])
 			fs->mq_map[cpu] = q + VQ_REQUEST;
 	}
 	kfree(masks);
diff --git a/include/linux/group_cpus.h b/include/linux/group_cpus.h
index e42807ec61f6e8cf3787af7daa0d8686edfef0a3..bd5dada6e8606fa6cf8f7babf939e39fd7475c8d 100644
--- a/include/linux/group_cpus.h
+++ b/include/linux/group_cpus.h
@@ -9,6 +9,7 @@
 #include <linux/kernel.h>
 #include <linux/cpu.h>
 
-struct cpumask *group_cpus_evenly(unsigned int numgrps);
+struct cpumask *group_cpus_evenly(unsigned int numgrps,
+				  unsigned int *nummasks);
 
 #endif
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 44a4eba80315cc098ecfa366ca1d88483641b12a..d2aefab5eb2b929877ced43f48b6268098484bd7 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -70,20 +70,21 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	 */
 	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
 		unsigned int this_vecs = affd->set_size[i];
+		unsigned int nr_masks;
 		int j;
-		struct cpumask *result = group_cpus_evenly(this_vecs);
+		struct cpumask *result = group_cpus_evenly(this_vecs, &nr_masks);
 
 		if (!result) {
 			kfree(masks);
 			return NULL;
 		}
 
-		for (j = 0; j < this_vecs; j++)
+		for (j = 0; j < nr_masks; j++)
 			cpumask_copy(&masks[curvec + j].mask, &result[j]);
 		kfree(result);
 
-		curvec += this_vecs;
-		usedvecs += this_vecs;
+		curvec += nr_masks;
+		usedvecs += nr_masks;
 	}
 
 	/* Fill out vectors at the end that don't need affinity */
diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index ee272c4cefcc13907ce9f211f479615d2e3c9154..016c6578a07616959470b47121459a16a1bc99e5 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -332,9 +332,11 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
 /**
  * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
  * @numgrps: number of groups
+ * @nummasks: number of initialized cpumasks
  *
  * Return: cpumask array if successful, NULL otherwise. And each element
- * includes CPUs assigned to this group
+ * includes CPUs assigned to this group. nummasks contains the number
+ * of initialized masks which can be less than numgrps.
  *
  * Try to put close CPUs from viewpoint of CPU and NUMA locality into
  * same group, and run two-stage grouping:
@@ -344,7 +346,8 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
  * We guarantee in the resulted grouping that all CPUs are covered, and
  * no same CPU is assigned to multiple groups
  */
-struct cpumask *group_cpus_evenly(unsigned int numgrps)
+struct cpumask *group_cpus_evenly(unsigned int numgrps,
+				  unsigned int *nummasks)
 {
 	unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
 	cpumask_var_t *node_to_cpumask;
@@ -421,10 +424,12 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 		kfree(masks);
 		return NULL;
 	}
+	*nummasks = nr_present + nr_others;
 	return masks;
 }
 #else /* CONFIG_SMP */
-struct cpumask *group_cpus_evenly(unsigned int numgrps)
+struct cpumask *group_cpus_evenly(unsigned int numgrps,
+				  unsigned int *nummasks)
 {
 	struct cpumask *masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
 
@@ -433,6 +438,7 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 
 	/* assign all CPUs(cpu 0) to the 1st group only */
 	cpumask_copy(&masks[0], cpu_possible_mask);
+	*nummasks = 1;
 	return masks;
 }
 #endif /* CONFIG_SMP */

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 2/9] blk-mq: add number of queue calc helper
  2025-01-10 16:26 [PATCH v5 0/9] blk: honor isolcpus configuration Daniel Wagner
  2025-01-10 16:26 ` [PATCH v5 1/9] lib/group_cpus: let group_cpu_evenly return number initialized masks Daniel Wagner
@ 2025-01-10 16:26 ` Daniel Wagner
  2025-01-10 16:26 ` [PATCH v5 3/9] nvme-pci: use block layer helpers to calculate num of queues Daniel Wagner
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Daniel Wagner @ 2025-01-10 16:26 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

Multiqueue devices should only allocate queues for the housekeeping CPUs
when isolcpus=managed_irq is set. This avoids that the isolated CPUs get
disturbed with OS workload.

Add two variants of helpers which calculates the correct number of
queues which should be used. The need for two variants is necessary
because some drivers calculate their max number of queues based on the
possible CPU mask, others based on the online CPU mask.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq-cpumap.c  | 45 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/blk-mq.h |  2 ++
 2 files changed, 47 insertions(+)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 7d3dfe885dfac18711ae73eff510efe3877ffcb6..0923cccdcbcad75ad107c3636af15b723356e087 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -12,10 +12,55 @@
 #include <linux/cpu.h>
 #include <linux/group_cpus.h>
 #include <linux/device/bus.h>
+#include <linux/sched/isolation.h>
 
 #include "blk.h"
 #include "blk-mq.h"
 
+static unsigned int blk_mq_num_queues(const struct cpumask *mask,
+				      unsigned int max_queues)
+{
+	unsigned int num;
+
+	if (housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
+		mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
+
+	num = cpumask_weight(mask);
+	return min_not_zero(num, max_queues);
+}
+
+/**
+ * blk_mq_num_possible_queues - Calc nr of queues for multiqueue devices
+ * @max_queues:	The maximal number of queues the hardware/driver
+ *		supports. If max_queues is 0, the argument is
+ *		ignored.
+ *
+ * Calculate the number of queues which should be used for a multiqueue
+ * device based on the number of possible cpu. The helper is considering
+ * isolcpus settings.
+ */
+unsigned int blk_mq_num_possible_queues(unsigned int max_queues)
+{
+	return blk_mq_num_queues(cpu_possible_mask, max_queues);
+}
+EXPORT_SYMBOL_GPL(blk_mq_num_possible_queues);
+
+/**
+ * blk_mq_num_online_queues - Calc nr of queues for multiqueue devices
+ * @max_queues:	The maximal number of queues the hardware/driver
+ *		supports. If max_queues is 0, the argument is
+ *		ignored.
+ *
+ * Calculate the number of queues which should be used for a multiqueue
+ * device based on the number of online cpus. The helper is considering
+ * isolcpus settings.
+ */
+unsigned int blk_mq_num_online_queues(unsigned int max_queues)
+{
+	return blk_mq_num_queues(cpu_online_mask, max_queues);
+}
+EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
+
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
 	const struct cpumask *masks;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index a0a9007cc1e36f89ebb21e699de3234a3cf9ef5b..f58172755e3464e305a86bbaa0a4509270fd7e0e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -909,6 +909,8 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 void blk_mq_unfreeze_queue_non_owner(struct request_queue *q);
 void blk_freeze_queue_start_non_owner(struct request_queue *q);
 
+unsigned int blk_mq_num_possible_queues(unsigned int max_queues);
+unsigned int blk_mq_num_online_queues(unsigned int max_queues);
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap);
 void blk_mq_map_hw_queues(struct blk_mq_queue_map *qmap,
 			  struct device *dev, unsigned int offset);

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 3/9] nvme-pci: use block layer helpers to calculate num of queues
  2025-01-10 16:26 [PATCH v5 0/9] blk: honor isolcpus configuration Daniel Wagner
  2025-01-10 16:26 ` [PATCH v5 1/9] lib/group_cpus: let group_cpu_evenly return number initialized masks Daniel Wagner
  2025-01-10 16:26 ` [PATCH v5 2/9] blk-mq: add number of queue calc helper Daniel Wagner
@ 2025-01-10 16:26 ` Daniel Wagner
  2025-01-10 16:26 ` [PATCH v5 4/9] scsi: " Daniel Wagner
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Daniel Wagner @ 2025-01-10 16:26 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

Multiqueue devices should only allocate queues for the housekeeping CPUs
when isolcpus=managed_irq is set. This avoids that the isolated CPUs get
disturbed with OS workload.

Use helpers which calculates the correct number of queues which should
be used when isolcpus is used.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/nvme/host/pci.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 709328a67f915aede5c6bae956e1bdd5e6f3f1bc..4af22f09ed8474676edd118477344ed32236c497 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -81,7 +81,7 @@ static int io_queue_count_set(const char *val, const struct kernel_param *kp)
 	int ret;
 
 	ret = kstrtouint(val, 10, &n);
-	if (ret != 0 || n > num_possible_cpus())
+	if (ret != 0 || n > blk_mq_num_possible_queues(0))
 		return -EINVAL;
 	return param_set_uint(val, kp);
 }
@@ -2439,7 +2439,8 @@ static unsigned int nvme_max_io_queues(struct nvme_dev *dev)
 	 */
 	if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
 		return 1;
-	return num_possible_cpus() + dev->nr_write_queues + dev->nr_poll_queues;
+	return blk_mq_num_possible_queues(0) + dev->nr_write_queues +
+		dev->nr_poll_queues;
 }
 
 static int nvme_setup_io_queues(struct nvme_dev *dev)

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 4/9] scsi: use block layer helpers to calculate num of queues
  2025-01-10 16:26 [PATCH v5 0/9] blk: honor isolcpus configuration Daniel Wagner
                   ` (2 preceding siblings ...)
  2025-01-10 16:26 ` [PATCH v5 3/9] nvme-pci: use block layer helpers to calculate num of queues Daniel Wagner
@ 2025-01-10 16:26 ` Daniel Wagner
  2025-01-10 21:37   ` Martin K. Petersen
  2025-01-13  7:10   ` Hannes Reinecke
  2025-01-10 16:26 ` [PATCH v5 5/9] virtio: blk/scsi: " Daniel Wagner
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 18+ messages in thread
From: Daniel Wagner @ 2025-01-10 16:26 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

Multiqueue devices should only allocate queues for the housekeeping CPUs
when isolcpus=managed_irq is set. This avoids that the isolated CPUs get
disturbed with OS workload.

Use helpers which calculates the correct number of queues which should
be used when isolcpus is used.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/scsi/megaraid/megaraid_sas_base.c | 15 +++++++++------
 drivers/scsi/qla2xxx/qla_isr.c            | 10 +++++-----
 drivers/scsi/smartpqi/smartpqi_init.c     |  5 ++---
 3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 49abd7dd75a7b7c1ddcfac41acecbbcf7de8f5a4..59d385e5a917979ae2f61f5db2c3355b9cab7e08 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -5962,7 +5962,8 @@ megasas_alloc_irq_vectors(struct megasas_instance *instance)
 		else
 			instance->iopoll_q_count = 0;
 
-		num_msix_req = num_online_cpus() + instance->low_latency_index_start;
+		num_msix_req = blk_mq_num_online_queues(0) +
+			instance->low_latency_index_start;
 		instance->msix_vectors = min(num_msix_req,
 				instance->msix_vectors);
 
@@ -5978,7 +5979,8 @@ megasas_alloc_irq_vectors(struct megasas_instance *instance)
 		/* Disable Balanced IOPS mode and try realloc vectors */
 		instance->perf_mode = MR_LATENCY_PERF_MODE;
 		instance->low_latency_index_start = 1;
-		num_msix_req = num_online_cpus() + instance->low_latency_index_start;
+		num_msix_req = blk_mq_num_online_queues(0) +
+			instance->low_latency_index_start;
 
 		instance->msix_vectors = min(num_msix_req,
 				instance->msix_vectors);
@@ -6234,7 +6236,7 @@ static int megasas_init_fw(struct megasas_instance *instance)
 		intr_coalescing = (scratch_pad_1 & MR_INTR_COALESCING_SUPPORT_OFFSET) ?
 								true : false;
 		if (intr_coalescing &&
-			(num_online_cpus() >= MR_HIGH_IOPS_QUEUE_COUNT) &&
+			(blk_mq_num_online_queues(0) >= MR_HIGH_IOPS_QUEUE_COUNT) &&
 			(instance->msix_vectors == MEGASAS_MAX_MSIX_QUEUES))
 			instance->perf_mode = MR_BALANCED_PERF_MODE;
 		else
@@ -6278,7 +6280,8 @@ static int megasas_init_fw(struct megasas_instance *instance)
 		else
 			instance->low_latency_index_start = 1;
 
-		num_msix_req = num_online_cpus() + instance->low_latency_index_start;
+		num_msix_req = blk_mq_num_online_queues(0) +
+			instance->low_latency_index_start;
 
 		instance->msix_vectors = min(num_msix_req,
 				instance->msix_vectors);
@@ -6310,8 +6313,8 @@ static int megasas_init_fw(struct megasas_instance *instance)
 	megasas_setup_reply_map(instance);
 
 	dev_info(&instance->pdev->dev,
-		"current msix/online cpus\t: (%d/%d)\n",
-		instance->msix_vectors, (unsigned int)num_online_cpus());
+		"current msix/max num queues\t: (%d/%u)\n",
+		instance->msix_vectors, blk_mq_num_online_queues(0));
 	dev_info(&instance->pdev->dev,
 		"RDPQ mode\t: (%s)\n", instance->is_rdpq ? "enabled" : "disabled");
 
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index fe98c76e9be32ff03a1960f366f0d700d1168383..c4c6b5c6658c0734f7ff68bcc31b33dde87296dd 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -4533,13 +4533,13 @@ qla24xx_enable_msix(struct qla_hw_data *ha, struct rsp_que *rsp)
 	if (USER_CTRL_IRQ(ha) || !ha->mqiobase) {
 		/* user wants to control IRQ setting for target mode */
 		ret = pci_alloc_irq_vectors(ha->pdev, min_vecs,
-		    min((u16)ha->msix_count, (u16)(num_online_cpus() + min_vecs)),
-		    PCI_IRQ_MSIX);
+			blk_mq_num_online_queues(ha->msix_count) + min_vecs,
+			PCI_IRQ_MSIX);
 	} else
 		ret = pci_alloc_irq_vectors_affinity(ha->pdev, min_vecs,
-		    min((u16)ha->msix_count, (u16)(num_online_cpus() + min_vecs)),
-		    PCI_IRQ_MSIX | PCI_IRQ_AFFINITY,
-		    &desc);
+			blk_mq_num_online_queues(ha->msix_count) + min_vecs,
+			PCI_IRQ_MSIX | PCI_IRQ_AFFINITY,
+			&desc);
 
 	if (ret < 0) {
 		ql_log(ql_log_fatal, vha, 0x00c7,
diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index 04fb24d77e9b5c0137f26bc41f17191cc4c49728..7636c8d1c9f14a0d887c1d517c3664f0d0df7e6e 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -5278,15 +5278,14 @@ static void pqi_calculate_queue_resources(struct pqi_ctrl_info *ctrl_info)
 	if (reset_devices) {
 		num_queue_groups = 1;
 	} else {
-		int num_cpus;
 		int max_queue_groups;
 
 		max_queue_groups = min(ctrl_info->max_inbound_queues / 2,
 			ctrl_info->max_outbound_queues - 1);
 		max_queue_groups = min(max_queue_groups, PQI_MAX_QUEUE_GROUPS);
 
-		num_cpus = num_online_cpus();
-		num_queue_groups = min(num_cpus, ctrl_info->max_msix_vectors);
+		num_queue_groups =
+			blk_mq_num_online_queues(ctrl_info->max_msix_vectors);
 		num_queue_groups = min(num_queue_groups, max_queue_groups);
 	}
 

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 5/9] virtio: blk/scsi: use block layer helpers to calculate num of queues
  2025-01-10 16:26 [PATCH v5 0/9] blk: honor isolcpus configuration Daniel Wagner
                   ` (3 preceding siblings ...)
  2025-01-10 16:26 ` [PATCH v5 4/9] scsi: " Daniel Wagner
@ 2025-01-10 16:26 ` Daniel Wagner
  2025-01-13  7:11   ` Hannes Reinecke
  2025-01-10 16:26 ` [PATCH v5 6/9] lib/group_cpus: honor housekeeping config when grouping CPUs Daniel Wagner
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Daniel Wagner @ 2025-01-10 16:26 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

Multiqueue devices should only allocate queues for the housekeeping CPUs
when isolcpus=managed_irq is set. This avoids that the isolated CPUs get
disturbed with OS workload.

Use helpers which calculates the correct number of queues which should
be used when isolcpus is used.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/block/virtio_blk.c | 5 ++---
 drivers/scsi/virtio_scsi.c | 1 +
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 71a7ffeafb32ccd6329102d3166da7cbc8bc9539..c5b2ceebd645659d86299d07224d85bb7671a9a7 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -976,9 +976,8 @@ static int init_vq(struct virtio_blk *vblk)
 		return -EINVAL;
 	}
 
-	num_vqs = min_t(unsigned int,
-			min_not_zero(num_request_queues, nr_cpu_ids),
-			num_vqs);
+	num_vqs = blk_mq_num_possible_queues(
+			min_not_zero(num_request_queues, num_vqs));
 
 	num_poll_vqs = min_t(unsigned int, poll_queues, num_vqs - 1);
 
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 60be1a0c61836ba643adcf9ad8d5b68563a86cb1..46ca0b82f57ce2211c7e2817dd40ee34e65bcbf9 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -919,6 +919,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
 	/* We need to know how many queues before we allocate. */
 	num_queues = virtscsi_config_get(vdev, num_queues) ? : 1;
 	num_queues = min_t(unsigned int, nr_cpu_ids, num_queues);
+	num_queues = blk_mq_num_possible_queues(num_queues);
 
 	num_targets = virtscsi_config_get(vdev, max_target) + 1;
 

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 6/9] lib/group_cpus: honor housekeeping config when grouping CPUs
  2025-01-10 16:26 [PATCH v5 0/9] blk: honor isolcpus configuration Daniel Wagner
                   ` (4 preceding siblings ...)
  2025-01-10 16:26 ` [PATCH v5 5/9] virtio: blk/scsi: " Daniel Wagner
@ 2025-01-10 16:26 ` Daniel Wagner
  2025-01-10 16:26 ` [PATCH v5 7/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled Daniel Wagner
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Daniel Wagner @ 2025-01-10 16:26 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

group_cpus_evenly distributes all present CPUs into groups. This ignores
the isolcpus configuration and assigns isolated CPUs into the groups.

Make group_cpus_evenly aware of isolcpus configuration and use the
housekeeping CPU mask as base for distributing the available CPUs into
groups.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 lib/group_cpus.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 79 insertions(+), 3 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 016c6578a07616959470b47121459a16a1bc99e5..ba112dda527552a031dff083e77b748ac2629ca8 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -8,6 +8,7 @@
 #include <linux/cpu.h>
 #include <linux/sort.h>
 #include <linux/group_cpus.h>
+#include <linux/sched/isolation.h>
 
 #ifdef CONFIG_SMP
 
@@ -330,7 +331,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
 }
 
 /**
- * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * group_possible_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
  * @numgrps: number of groups
  * @nummasks: number of initialized cpumasks
  *
@@ -346,8 +347,8 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
  * We guarantee in the resulted grouping that all CPUs are covered, and
  * no same CPU is assigned to multiple groups
  */
-struct cpumask *group_cpus_evenly(unsigned int numgrps,
-				  unsigned int *nummasks)
+static struct cpumask *group_possible_cpus_evenly(unsigned int numgrps,
+						  unsigned int *nummasks)
 {
 	unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
 	cpumask_var_t *node_to_cpumask;
@@ -427,6 +428,81 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps,
 	*nummasks = nr_present + nr_others;
 	return masks;
 }
+
+/**
+ * group_mask_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * @numgrps: number of groups
+ * @cpu_mask: CPU to consider for the grouping
+ * @nummasks: number of initialized cpusmasks
+ *
+ * Return: cpumask array if successful, NULL otherwise. And each element
+ * includes CPUs assigned to this group.
+ *
+ * Try to put close CPUs from viewpoint of CPU and NUMA locality into
+ * same group. Allocate present CPUs on these groups evenly.
+ */
+static struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
+					      const struct cpumask *cpu_mask,
+					      unsigned int *nummasks)
+{
+	cpumask_var_t *node_to_cpumask;
+	cpumask_var_t nmsk;
+	int ret = -ENOMEM;
+	struct cpumask *masks = NULL;
+
+	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+		return NULL;
+
+	node_to_cpumask = alloc_node_to_cpumask();
+	if (!node_to_cpumask)
+		goto fail_nmsk;
+
+	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
+	if (!masks)
+		goto fail_node_to_cpumask;
+
+	build_node_to_cpumask(node_to_cpumask);
+
+	ret = __group_cpus_evenly(0, numgrps, node_to_cpumask, cpu_mask, nmsk,
+				  masks);
+
+fail_node_to_cpumask:
+	free_node_to_cpumask(node_to_cpumask);
+
+fail_nmsk:
+	free_cpumask_var(nmsk);
+	if (ret < 0) {
+		kfree(masks);
+		return NULL;
+	}
+	*nummasks = ret;
+	return masks;
+}
+
+/**
+ * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * @numgrps: number of groups
+ * @nummasks: number of initialized cpusmasks
+ *
+ * Return: cpumask array if successful, NULL otherwise.
+ *
+ * group_possible_cpus_evently() is used for distributing the cpus on all
+ * possible cpus in absence of isolcpus command line argument.
+ * group_mask_cpu_evenly() is used when the isolcpus command line
+ * argument is used with managed_irq option. In this case only the
+ * housekeeping CPUs are considered.
+ */
+struct cpumask *group_cpus_evenly(unsigned int numgrps,
+				  unsigned int *nummasks)
+{
+	if (housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
+		return group_mask_cpus_evenly(numgrps,
+				housekeeping_cpumask(HK_TYPE_MANAGED_IRQ),
+				nummasks);
+	}
+
+	return group_possible_cpus_evenly(numgrps, nummasks);
+}
 #else /* CONFIG_SMP */
 struct cpumask *group_cpus_evenly(unsigned int numgrps,
 				  unsigned int *nummasks)

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 7/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled
  2025-01-10 16:26 [PATCH v5 0/9] blk: honor isolcpus configuration Daniel Wagner
                   ` (5 preceding siblings ...)
  2025-01-10 16:26 ` [PATCH v5 6/9] lib/group_cpus: honor housekeeping config when grouping CPUs Daniel Wagner
@ 2025-01-10 16:26 ` Daniel Wagner
  2025-01-13  7:13   ` Hannes Reinecke
  2025-01-10 16:26 ` [PATCH v5 8/9] blk-mq: issue warning when offlining hctx with online isolcpus Daniel Wagner
  2025-01-10 16:26 ` [PATCH v5 9/9] doc: update managed_irq documentation Daniel Wagner
  8 siblings, 1 reply; 18+ messages in thread
From: Daniel Wagner @ 2025-01-10 16:26 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

When isolcpus=managed_irq is enabled all hardware queues should run on
the housekeeping CPUs only. Thus ignore the affinity mask provided by
the driver. Also we can't use blk_mq_map_queues because it will map all
CPUs to first hctx unless, the CPU is the same as the hctx has the
affinity set to, e.g. 8 CPUs with isolcpus=managed_irq,2-3,6-7 config

  queue mapping for /dev/nvme0n1
        hctx0: default 2 3 4 6 7
        hctx1: default 5
        hctx2: default 0
        hctx3: default 1

  PCI name is 00:05.0: nvme0n1
        irq 57 affinity 0-1 effective 1 is_managed:0 nvme0q0
        irq 58 affinity 4 effective 4 is_managed:1 nvme0q1
        irq 59 affinity 5 effective 5 is_managed:1 nvme0q2
        irq 60 affinity 0 effective 0 is_managed:1 nvme0q3
        irq 61 affinity 1 effective 1 is_managed:1 nvme0q4

where as with blk_mq_hk_map_queues we get

  queue mapping for /dev/nvme0n1
        hctx0: default 2 4
        hctx1: default 3 5
        hctx2: default 0 6
        hctx3: default 1 7

  PCI name is 00:05.0: nvme0n1
        irq 56 affinity 0-1 effective 1 is_managed:0 nvme0q0
        irq 61 affinity 4 effective 4 is_managed:1 nvme0q1
        irq 62 affinity 5 effective 5 is_managed:1 nvme0q2
        irq 63 affinity 0 effective 0 is_managed:1 nvme0q3
        irq 64 affinity 1 effective 1 is_managed:1 nvme0q4

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq-cpumap.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 0923cccdcbcad75ad107c3636af15b723356e087..e78eebbbaf0a2e0e8e03a2b31087c62a9090808c 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -61,11 +61,73 @@ unsigned int blk_mq_num_online_queues(unsigned int max_queues)
 }
 EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
 
+/*
+ * blk_mq_map_hk_queues - Create housekeeping CPU to hardware queue mapping
+ * @qmap:	CPU to hardware queue map
+ *
+ * Create a housekeeping CPU to hardware queue mapping in @qmap. If the
+ * isolcpus feature is enabled and blk_mq_map_hk_queues returns true,
+ * @qmap contains a valid configuration honoring the managed_irq
+ * configuration. If the isolcpus feature is disabled this function
+ * returns false.
+ */
+static bool blk_mq_map_hk_queues(struct blk_mq_queue_map *qmap)
+{
+	struct cpumask *hk_masks;
+	cpumask_var_t isol_mask;
+	unsigned int queue, cpu, nr_masks;
+
+	if (!housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
+		return false;
+
+	/* map housekeeping cpus to matching hardware context */
+	hk_masks = group_cpus_evenly(qmap->nr_queues, &nr_masks);
+	if (!hk_masks)
+		goto fallback;
+
+	for (queue = 0; queue < qmap->nr_queues; queue++) {
+		for_each_cpu(cpu, &hk_masks[queue % nr_masks])
+			qmap->mq_map[cpu] = qmap->queue_offset + queue;
+	}
+
+	kfree(hk_masks);
+
+	/* map isolcpus to hardware context */
+	if (!alloc_cpumask_var(&isol_mask, GFP_KERNEL))
+		goto fallback;
+
+	queue = 0;
+	cpumask_andnot(isol_mask,
+		       cpu_possible_mask,
+		       housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
+
+	for_each_cpu(cpu, isol_mask) {
+		qmap->mq_map[cpu] = qmap->queue_offset + queue;
+		queue = (queue + 1) % qmap->nr_queues;
+	}
+
+	free_cpumask_var(isol_mask);
+
+	return true;
+
+fallback:
+	/* map all cpus to hardware context ignoring any affinity */
+	queue = 0;
+	for_each_possible_cpu(cpu) {
+		qmap->mq_map[cpu] = qmap->queue_offset + queue;
+		queue = (queue + 1) % qmap->nr_queues;
+	}
+	return true;
+}
+
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
 	const struct cpumask *masks;
 	unsigned int queue, cpu, nr_masks;
 
+	if (blk_mq_map_hk_queues(qmap))
+		return;
+
 	masks = group_cpus_evenly(qmap->nr_queues, &nr_masks);
 	if (!masks) {
 		for_each_possible_cpu(cpu)
@@ -120,6 +182,9 @@ void blk_mq_map_hw_queues(struct blk_mq_queue_map *qmap,
 	if (!dev->bus->irq_get_affinity)
 		goto fallback;
 
+	if (blk_mq_map_hk_queues(qmap))
+		return;
+
 	for (queue = 0; queue < qmap->nr_queues; queue++) {
 		mask = dev->bus->irq_get_affinity(dev, queue + offset);
 		if (!mask)

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 8/9] blk-mq: issue warning when offlining hctx with online isolcpus
  2025-01-10 16:26 [PATCH v5 0/9] blk: honor isolcpus configuration Daniel Wagner
                   ` (6 preceding siblings ...)
  2025-01-10 16:26 ` [PATCH v5 7/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled Daniel Wagner
@ 2025-01-10 16:26 ` Daniel Wagner
  2025-01-11  3:40   ` Ming Lei
  2025-01-13  7:14   ` Hannes Reinecke
  2025-01-10 16:26 ` [PATCH v5 9/9] doc: update managed_irq documentation Daniel Wagner
  8 siblings, 2 replies; 18+ messages in thread
From: Daniel Wagner @ 2025-01-10 16:26 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

When isolcpus=managed_irq is enabled, and the last housekeeping CPU for
a given hardware context goes offline, there is no CPU left which
handles the IOs anymore. If isolated CPUs mapped to this hardware
context are online and an application running on these isolated CPUs
issue an IO this will lead to stalls.

The kernel will not schedule IO to isolated CPUS thus this avoids IO
stalls.

Thus issue a warning when housekeeping CPUs are offlined for a hardware
context while there are still isolated CPUs online.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2e6132f778fd958aae3cad545e4b3dd623c9c304..43eab0db776d37ffd7eb6c084211b5e05d41a574 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3620,6 +3620,45 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
 	return data.has_rq;
 }
 
+static void blk_mq_hctx_check_isolcpus_online(struct blk_mq_hw_ctx *hctx, unsigned int cpu)
+{
+	const struct cpumask *hk_mask;
+	int i;
+
+	if (!housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
+		return;
+
+	hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
+
+	for (i = 0; i < hctx->nr_ctx; i++) {
+		struct blk_mq_ctx *ctx = hctx->ctxs[i];
+
+		if (ctx->cpu == cpu)
+			continue;
+
+		/*
+		 * Check if this context has at least one online
+		 * housekeeping CPU in this case the hardware context is
+		 * usable.
+		 */
+		if (cpumask_test_cpu(ctx->cpu, hk_mask) &&
+		    cpu_online(ctx->cpu))
+			break;
+
+		/*
+		 * The context doesn't have any online housekeeping CPUs
+		 * but there might be an online isolated CPU mapped to
+		 * it.
+		 */
+		if (cpu_is_offline(ctx->cpu))
+			continue;
+
+		pr_warn("%s: offlining hctx%d but there is still an online isolcpu CPU %d mapped to it, IO stalls expected\n",
+			hctx->queue->disk->disk_name,
+			hctx->queue_num, ctx->cpu);
+	}
+}
+
 static bool blk_mq_hctx_has_online_cpu(struct blk_mq_hw_ctx *hctx,
 		unsigned int this_cpu)
 {
@@ -3639,8 +3678,10 @@ static bool blk_mq_hctx_has_online_cpu(struct blk_mq_hw_ctx *hctx,
 			continue;
 
 		/* this hctx has at least one online CPU */
-		if (this_cpu != cpu)
+		if (this_cpu != cpu) {
+			blk_mq_hctx_check_isolcpus_online(hctx, this_cpu);
 			return true;
+		}
 	}
 
 	return false;

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 9/9] doc: update managed_irq documentation
  2025-01-10 16:26 [PATCH v5 0/9] blk: honor isolcpus configuration Daniel Wagner
                   ` (7 preceding siblings ...)
  2025-01-10 16:26 ` [PATCH v5 8/9] blk-mq: issue warning when offlining hctx with online isolcpus Daniel Wagner
@ 2025-01-10 16:26 ` Daniel Wagner
  2025-01-13  7:14   ` Hannes Reinecke
  8 siblings, 1 reply; 18+ messages in thread
From: Daniel Wagner @ 2025-01-10 16:26 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

The managed_irq documentation is a bit difficult to understand. Rephrase
the current text and add the latest changes how managed_irq CPU sets are
handled.

Isolated CPUs and housekeeping CPUs are grouped into sets and the
possibility of stalls if all housekeeping CPUs are offlined in a set.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 Documentation/admin-guide/kernel-parameters.txt | 46 +++++++++++++------------
 1 file changed, 24 insertions(+), 22 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3872bc6ec49d63772755504966ae70113f24a1db..e4bf1fc984943c1d4938dffb85d97da05010a325 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2460,28 +2460,30 @@
 			  "number of CPUs in system - 1".
 
 			managed_irq
-
-			  Isolate from being targeted by managed interrupts
-			  which have an interrupt mask containing isolated
-			  CPUs. The affinity of managed interrupts is
-			  handled by the kernel and cannot be changed via
-			  the /proc/irq/* interfaces.
-
-			  This isolation is best effort and only effective
-			  if the automatically assigned interrupt mask of a
-			  device queue contains isolated and housekeeping
-			  CPUs. If housekeeping CPUs are online then such
-			  interrupts are directed to the housekeeping CPU
-			  so that IO submitted on the housekeeping CPU
-			  cannot disturb the isolated CPU.
-
-			  If a queue's affinity mask contains only isolated
-			  CPUs then this parameter has no effect on the
-			  interrupt routing decision, though interrupts are
-			  only delivered when tasks running on those
-			  isolated CPUs submit IO. IO submitted on
-			  housekeeping CPUs has no influence on those
-			  queues.
+			  Isolate CPUs from IRQ-related work for drivers
+			  that support managed interrupts, ensuring no
+			  IRQ work is scheduled on the isolated CPUs. The
+			  kernel manages the affinity of managed
+			  interrupts, which cannot be changed via the
+			  /proc/irq/* interfaces.
+
+			  Since isolated CPUs do not handle IRQ work, the
+			  work is forwarded to housekeeping CPUs.
+			  Housekeeping and isolated CPUs are grouped into
+			  sets, ensuring at least one housekeeping CPU is
+			  available per set. Consequently, if all
+			  housekeeping CPUs in a set are offlined, there
+			  will be no CPU available to handle IRQ work for
+			  the isolated CPUs. Therefore, users should
+			  offline all isolated CPUs before offlining the
+			  housekeeping CPUs in a set to avoid stalls.
+
+			  The block layer ensures that no I/O is
+			  scheduled on isolated CPU, except when user
+			  applications running on the isolated CPUs issue
+			  I/O requests. In this case the I/O is issued
+			  from the isolated CPU and the IRQ related work
+			  is forwared to a housekeeping CPU.
 
 			The format of <cpu-list> is described above.
 

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/9] scsi: use block layer helpers to calculate num of queues
  2025-01-10 16:26 ` [PATCH v5 4/9] scsi: " Daniel Wagner
@ 2025-01-10 21:37   ` Martin K. Petersen
  2025-01-13  7:10   ` Hannes Reinecke
  1 sibling, 0 replies; 18+ messages in thread
From: Martin K. Petersen @ 2025-01-10 21:37 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream


Daniel,

> Multiqueue devices should only allocate queues for the housekeeping CPUs
> when isolcpus=managed_irq is set. This avoids that the isolated CPUs get
> disturbed with OS workload.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 8/9] blk-mq: issue warning when offlining hctx with online isolcpus
  2025-01-10 16:26 ` [PATCH v5 8/9] blk-mq: issue warning when offlining hctx with online isolcpus Daniel Wagner
@ 2025-01-11  3:40   ` Ming Lei
  2025-01-13  7:14   ` Hannes Reinecke
  1 sibling, 0 replies; 18+ messages in thread
From: Ming Lei @ 2025-01-11  3:40 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Frederic Weisbecker, Mel Gorman, Hannes Reinecke, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization, GR-QLogic-Storage-Upstream

Hi Daniel,

On Fri, Jan 10, 2025 at 05:26:46PM +0100, Daniel Wagner wrote:
> When isolcpus=managed_irq is enabled, and the last housekeeping CPU for
> a given hardware context goes offline, there is no CPU left which
> handles the IOs anymore. If isolated CPUs mapped to this hardware
> context are online and an application running on these isolated CPUs
> issue an IO this will lead to stalls.
> 
> The kernel will not schedule IO to isolated CPUS thus this avoids IO
> stalls.
> 
> Thus issue a warning when housekeeping CPUs are offlined for a hardware
> context while there are still isolated CPUs online.

Why do you continue to send patch without addressing the fundamental regression?

This patchset does break existed applications which can't follow the new
rule of offlining CPU in order.

Again, it violates no-regression rule of kernel development.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/9] lib/group_cpus: let group_cpu_evenly return number initialized masks
  2025-01-10 16:26 ` [PATCH v5 1/9] lib/group_cpus: let group_cpu_evenly return number initialized masks Daniel Wagner
@ 2025-01-13  7:07   ` Hannes Reinecke
  0 siblings, 0 replies; 18+ messages in thread
From: Hannes Reinecke @ 2025-01-13  7:07 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 1/10/25 17:26, Daniel Wagner wrote:
> group_cpu_evenly might allocated less groups then the requested:
> 
> group_cpu_evenly
>    __group_cpus_evenly
>      alloc_nodes_groups
>        # allocated total groups may be less than numgrps when
>        # active total CPU number is less then numgrps
> 
> In this case, the caller will do an out of bound access because the
> caller assumes the masks returned has numgrps.
> 
> Return the number of groups created so the caller can limit the access
> range accordingly.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   block/blk-mq-cpumap.c        |  6 +++---
>   drivers/virtio/virtio_vdpa.c |  9 +++++----
>   fs/fuse/virtio_fs.c          |  6 +++---
>   include/linux/group_cpus.h   |  3 ++-
>   kernel/irq/affinity.c        |  9 +++++----
>   lib/group_cpus.c             | 12 +++++++++---
>   6 files changed, 27 insertions(+), 18 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/9] scsi: use block layer helpers to calculate num of queues
  2025-01-10 16:26 ` [PATCH v5 4/9] scsi: " Daniel Wagner
  2025-01-10 21:37   ` Martin K. Petersen
@ 2025-01-13  7:10   ` Hannes Reinecke
  1 sibling, 0 replies; 18+ messages in thread
From: Hannes Reinecke @ 2025-01-13  7:10 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 1/10/25 17:26, Daniel Wagner wrote:
> Multiqueue devices should only allocate queues for the housekeeping CPUs
> when isolcpus=managed_irq is set. This avoids that the isolated CPUs get
> disturbed with OS workload.
> 
> Use helpers which calculates the correct number of queues which should
> be used when isolcpus is used.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   drivers/scsi/megaraid/megaraid_sas_base.c | 15 +++++++++------
>   drivers/scsi/qla2xxx/qla_isr.c            | 10 +++++-----
>   drivers/scsi/smartpqi/smartpqi_init.c     |  5 ++---
>   3 files changed, 16 insertions(+), 14 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 5/9] virtio: blk/scsi: use block layer helpers to calculate num of queues
  2025-01-10 16:26 ` [PATCH v5 5/9] virtio: blk/scsi: " Daniel Wagner
@ 2025-01-13  7:11   ` Hannes Reinecke
  0 siblings, 0 replies; 18+ messages in thread
From: Hannes Reinecke @ 2025-01-13  7:11 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 1/10/25 17:26, Daniel Wagner wrote:
> Multiqueue devices should only allocate queues for the housekeeping CPUs
> when isolcpus=managed_irq is set. This avoids that the isolated CPUs get
> disturbed with OS workload.
> 
> Use helpers which calculates the correct number of queues which should
> be used when isolcpus is used.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   drivers/block/virtio_blk.c | 5 ++---
>   drivers/scsi/virtio_scsi.c | 1 +
>   2 files changed, 3 insertions(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 7/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled
  2025-01-10 16:26 ` [PATCH v5 7/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled Daniel Wagner
@ 2025-01-13  7:13   ` Hannes Reinecke
  0 siblings, 0 replies; 18+ messages in thread
From: Hannes Reinecke @ 2025-01-13  7:13 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 1/10/25 17:26, Daniel Wagner wrote:
> When isolcpus=managed_irq is enabled all hardware queues should run on
> the housekeeping CPUs only. Thus ignore the affinity mask provided by
> the driver. Also we can't use blk_mq_map_queues because it will map all
> CPUs to first hctx unless, the CPU is the same as the hctx has the
> affinity set to, e.g. 8 CPUs with isolcpus=managed_irq,2-3,6-7 config
> 
>    queue mapping for /dev/nvme0n1
>          hctx0: default 2 3 4 6 7
>          hctx1: default 5
>          hctx2: default 0
>          hctx3: default 1
> 
>    PCI name is 00:05.0: nvme0n1
>          irq 57 affinity 0-1 effective 1 is_managed:0 nvme0q0
>          irq 58 affinity 4 effective 4 is_managed:1 nvme0q1
>          irq 59 affinity 5 effective 5 is_managed:1 nvme0q2
>          irq 60 affinity 0 effective 0 is_managed:1 nvme0q3
>          irq 61 affinity 1 effective 1 is_managed:1 nvme0q4
> 
> where as with blk_mq_hk_map_queues we get
> 
>    queue mapping for /dev/nvme0n1
>          hctx0: default 2 4
>          hctx1: default 3 5
>          hctx2: default 0 6
>          hctx3: default 1 7
> 
>    PCI name is 00:05.0: nvme0n1
>          irq 56 affinity 0-1 effective 1 is_managed:0 nvme0q0
>          irq 61 affinity 4 effective 4 is_managed:1 nvme0q1
>          irq 62 affinity 5 effective 5 is_managed:1 nvme0q2
>          irq 63 affinity 0 effective 0 is_managed:1 nvme0q3
>          irq 64 affinity 1 effective 1 is_managed:1 nvme0q4
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   block/blk-mq-cpumap.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 65 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 8/9] blk-mq: issue warning when offlining hctx with online isolcpus
  2025-01-10 16:26 ` [PATCH v5 8/9] blk-mq: issue warning when offlining hctx with online isolcpus Daniel Wagner
  2025-01-11  3:40   ` Ming Lei
@ 2025-01-13  7:14   ` Hannes Reinecke
  1 sibling, 0 replies; 18+ messages in thread
From: Hannes Reinecke @ 2025-01-13  7:14 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 1/10/25 17:26, Daniel Wagner wrote:
> When isolcpus=managed_irq is enabled, and the last housekeeping CPU for
> a given hardware context goes offline, there is no CPU left which
> handles the IOs anymore. If isolated CPUs mapped to this hardware
> context are online and an application running on these isolated CPUs
> issue an IO this will lead to stalls.
> 
> The kernel will not schedule IO to isolated CPUS thus this avoids IO
> stalls.
> 
> Thus issue a warning when housekeeping CPUs are offlined for a hardware
> context while there are still isolated CPUs online.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   block/blk-mq.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 42 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 9/9] doc: update managed_irq documentation
  2025-01-10 16:26 ` [PATCH v5 9/9] doc: update managed_irq documentation Daniel Wagner
@ 2025-01-13  7:14   ` Hannes Reinecke
  0 siblings, 0 replies; 18+ messages in thread
From: Hannes Reinecke @ 2025-01-13  7:14 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 1/10/25 17:26, Daniel Wagner wrote:
> The managed_irq documentation is a bit difficult to understand. Rephrase
> the current text and add the latest changes how managed_irq CPU sets are
> handled.
> 
> Isolated CPUs and housekeeping CPUs are grouped into sets and the
> possibility of stalls if all housekeeping CPUs are offlined in a set.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   Documentation/admin-guide/kernel-parameters.txt | 46 +++++++++++++------------
>   1 file changed, 24 insertions(+), 22 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-01-13  7:14 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-10 16:26 [PATCH v5 0/9] blk: honor isolcpus configuration Daniel Wagner
2025-01-10 16:26 ` [PATCH v5 1/9] lib/group_cpus: let group_cpu_evenly return number initialized masks Daniel Wagner
2025-01-13  7:07   ` Hannes Reinecke
2025-01-10 16:26 ` [PATCH v5 2/9] blk-mq: add number of queue calc helper Daniel Wagner
2025-01-10 16:26 ` [PATCH v5 3/9] nvme-pci: use block layer helpers to calculate num of queues Daniel Wagner
2025-01-10 16:26 ` [PATCH v5 4/9] scsi: " Daniel Wagner
2025-01-10 21:37   ` Martin K. Petersen
2025-01-13  7:10   ` Hannes Reinecke
2025-01-10 16:26 ` [PATCH v5 5/9] virtio: blk/scsi: " Daniel Wagner
2025-01-13  7:11   ` Hannes Reinecke
2025-01-10 16:26 ` [PATCH v5 6/9] lib/group_cpus: honor housekeeping config when grouping CPUs Daniel Wagner
2025-01-10 16:26 ` [PATCH v5 7/9] blk-mq: use hk cpus only when isolcpus=managed_irq is enabled Daniel Wagner
2025-01-13  7:13   ` Hannes Reinecke
2025-01-10 16:26 ` [PATCH v5 8/9] blk-mq: issue warning when offlining hctx with online isolcpus Daniel Wagner
2025-01-11  3:40   ` Ming Lei
2025-01-13  7:14   ` Hannes Reinecke
2025-01-10 16:26 ` [PATCH v5 9/9] doc: update managed_irq documentation Daniel Wagner
2025-01-13  7:14   ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).