[PATCH v7 00/10] blk: honor isolcpus configuration

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 00/10] blk: honor isolcpus configuration
@ 2025-07-02 16:33 Daniel Wagner
  2025-07-02 16:33 ` [PATCH v7 01/10] lib/group_cpus: Add group_masks_cpus_evenly() Daniel Wagner
                   ` (9 more replies)
  0 siblings, 10 replies; 36+ messages in thread
From: Daniel Wagner @ 2025-07-02 16:33 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream, Daniel Wagner

The first part from v6 is already in linux-next.

One of the changes requested in v6 was to only use group_mask_cpus_evenly
when the caller explicitly asks for it, instead of applying it at a
global level. To support this, I decided to add a cpumask to struct
irq_affinity, which is used when creating CPU groups. During testing, I
realized it was also necessary to provide two additional block layer
helpers to return such masks: blk_mq_{possible|online}_queue_affinity.
These match the existing blk_mq_{possible|online}_queues.

As a result, this version again includes bits and pieces that touch
several subsystems. If the general approach of extending struct
irq_affinity is acceptable but there are additional change requests,
particularly in the isolcpus code, I’m happy to split the first part out
and have it reviewed separately. Let's see how the review goes first,
though. From what I understand, this is the part Aaron is interested in.

Ming requested adding HK_TYPE_MANAGED_IRQ to blk_mq_map_hk_queues, but
it's not really needed. blk_mq_map_hk_queues is only used for
HK_TYPE_IO_QUEUE, the managed irq wants to use the existing mapping code
(don't change the existing behavior). This is why I haven’t added it.

The mapping code is now capable of generating a valid configuration
across various system setups, for example, when the number of online CPUs
is less than the number of possible CPUs, or when the number of requested
queues is fewer than the number of online CPUs. Nevertheless I've added
validation code so the system doesn't hang on boot when it's not possible
to create a valid configuration.

I've started testing the drivers, but I don’t have access to all of them.
So far, nvme-pci, virtio, and qla2xxx are tested. The rest are on my TODO
list. I’m sure not everything is working yet, but I think it’s time to
post an update and collect feedback to see if this is heading in the
right direction.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
Changes in v7:
- send out first part of the series:
  https://lore.kernel.org/all/20250617-isolcpus-queue-counters-v1-0-13923686b54b@kernel.org/
- added command line documentation
- added validation code, so that resulting mapping is operational
- rewrote mapping code for isolcpus so it takes into account active hctx
- added blk_mq_map_hk_irq_queues which uses mask from irq_get_affinity
- refactored blk_mq_map_hk_queues so caller tests for HK_TYPE_MANAGED_IRQ
- Link to v6: https://patch.msgid.link/20250424-isolcpus-io-queues-v6-0-9a53a870ca1f@kernel.org

Changes in v6:
- added io_queue isolcpus type back
- prevent offlining hk cpu if a isol cpu is still present isntead just warning
- Link to v5: https://lore.kernel.org/r/20250110-isolcpus-io-queues-v5-0-0e4f118680b0@kernel.org

Changes in v5:
- rebased on latest for-6.14/block
- udpated documetation on managed_irq
- updated commit message "blk-mq: issue warning when offlining hctx with online isolcpus"
- split input/output parameter in "lib/group_cpus: let group_cpu_evenly return number of groups"
- dropped "sched/isolation: document HK_TYPE housekeeping option"
- Link to v4: https://lore.kernel.org/r/20241217-isolcpus-io-queues-v4-0-5d355fbb1e14@kernel.org

Changes in v4:
- added "blk-mq: issue warning when offlining hctx with online isolcpus"
- fixed check in cgroup_cpus_evenly, the if condition needs to use
  housekeeping_enabled() and not cpusmask_weight(housekeeping_masks),
  because the later will always return a valid mask.
- dropped fixed tag from "lib/group_cpus.c: honor housekeeping config when
  grouping CPUs"
- fixed overlong line "scsi: use block layer helpers to calculate num
  of queues"
- dropped "sched/isolation: Add io_queue housekeeping option",
  just document the housekeep enum hk_type
- added "lib/group_cpus: let group_cpu_evenly return number of groups"
- collected tags
- splitted series into a preperation series:
  https://lore.kernel.org/linux-nvme/20241202-refactor-blk-affinity-helpers-v6-0-27211e9c2cd5@kernel.org/
- Link to v3: https://lore.kernel.org/r/20240806-isolcpus-io-queues-v3-0-da0eecfeaf8b@suse.de

Changes in v3:
- lifted a couple of patches from
  https://lore.kernel.org/all/20210709081005.421340-1-ming.lei@redhat.com/
  "virito: add APIs for retrieving vq affinity"
  "blk-mq: introduce blk_mq_dev_map_queues"
- replaces all users of blk_mq_[pci|virtio]_map_queues with
  blk_mq_dev_map_queues
- updated/extended number of queue calc helpers
- add isolcpus=io_queue CPU-hctx mapping function
- documented enum hk_type and isolcpus=io_queue
- added "scsi: pm8001: do not overwrite PCI queue mapping"
- Link to v2: https://lore.kernel.org/r/20240627-isolcpus-io-queues-v2-0-26a32e3c4f75@suse.de

Changes in v2:
- updated documentation
- splitted blk/nvme-pci patch
- dropped HK_TYPE_IO_QUEUE, use HK_TYPE_MANAGED_IRQ
- Link to v1: https://lore.kernel.org/r/20240621-isolcpus-io-queues-v1-0-8b169bf41083@suse.de

---
Daniel Wagner (10):
      lib/group_cpus: Add group_masks_cpus_evenly()
      genirq/affinity: Add cpumask to struct irq_affinity
      blk-mq: add blk_mq_{online|possible}_queue_affinity
      nvme-pci: use block layer helpers to constrain queue affinity
      scsi: Use block layer helpers to constrain queue affinity
      virtio: blk/scsi: use block layer helpers to constrain queue affinity
      isolation: Introduce io_queue isolcpus type
      blk-mq: use hk cpus only when isolcpus=io_queue is enabled
      blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
      docs: add io_queue flag to isolcpus

 Documentation/admin-guide/kernel-parameters.txt |  19 ++-
 block/blk-mq-cpumap.c                           | 218 +++++++++++++++++++++++-
 block/blk-mq.c                                  |  42 +++++
 drivers/block/virtio_blk.c                      |   4 +-
 drivers/nvme/host/pci.c                         |   1 +
 drivers/scsi/fnic/fnic_isr.c                    |   7 +-
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c          |   1 +
 drivers/scsi/megaraid/megaraid_sas_base.c       |   5 +-
 drivers/scsi/mpi3mr/mpi3mr_fw.c                 |   6 +-
 drivers/scsi/mpt3sas/mpt3sas_base.c             |   5 +-
 drivers/scsi/pm8001/pm8001_init.c               |   1 +
 drivers/scsi/qla2xxx/qla_isr.c                  |   1 +
 drivers/scsi/smartpqi/smartpqi_init.c           |   7 +-
 drivers/scsi/virtio_scsi.c                      |   5 +-
 include/linux/blk-mq.h                          |   2 +
 include/linux/group_cpus.h                      |   3 +
 include/linux/interrupt.h                       |  16 +-
 include/linux/sched/isolation.h                 |   1 +
 kernel/irq/affinity.c                           |  12 +-
 kernel/sched/isolation.c                        |   7 +
 lib/group_cpus.c                                |  64 ++++++-
 21 files changed, 405 insertions(+), 22 deletions(-)
---
base-commit: 32f85e8468ce081d8e73ca3f0d588f1004013037
change-id: 20240620-isolcpus-io-queues-1a88eb47ff8b

Best regards,
-- 
Daniel Wagner <wagi@kernel.org>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v7 01/10] lib/group_cpus: Add group_masks_cpus_evenly()
  2025-07-02 16:33 [PATCH v7 00/10] blk: honor isolcpus configuration Daniel Wagner
@ 2025-07-02 16:33 ` Daniel Wagner
  2025-07-03  6:18   ` Hannes Reinecke
  2025-07-11  8:28   ` John Garry
  2025-07-02 16:33 ` [PATCH v7 02/10] genirq/affinity: Add cpumask to struct irq_affinity Daniel Wagner
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 36+ messages in thread
From: Daniel Wagner @ 2025-07-02 16:33 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream, Daniel Wagner

group_mask_cpus_evenly() allows the caller to pass in a CPU mask that
should be evenly distributed. This new function is a more generic
version of the existing group_cpus_evenly(), which always distributes
all present CPUs into groups.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 include/linux/group_cpus.h |  3 +++
 lib/group_cpus.c           | 64 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/include/linux/group_cpus.h b/include/linux/group_cpus.h
index 9d4e5ab6c314b31c09fda82c3f6ac18f77e9de36..d4604dce1316a08400e982039006331f34c18ee8 100644
--- a/include/linux/group_cpus.h
+++ b/include/linux/group_cpus.h
@@ -10,5 +10,8 @@
 #include <linux/cpu.h>
 
 struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks);
+struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
+				       const struct cpumask *cpu_mask,
+				       unsigned int *nummasks);
 
 #endif
diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 6d08ac05f371bf880571507d935d9eb501616a84..00c9b7a10c8acd29239fe20d2a30fdae22ef74a5 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -8,6 +8,7 @@
 #include <linux/cpu.h>
 #include <linux/sort.h>
 #include <linux/group_cpus.h>
+#include <linux/sched/isolation.h>
 
 #ifdef CONFIG_SMP
 
@@ -425,6 +426,59 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
 	*nummasks = min(nr_present + nr_others, numgrps);
 	return masks;
 }
+EXPORT_SYMBOL_GPL(group_cpus_evenly);
+
+/**
+ * group_mask_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * @numgrps: number of groups
+ * @cpu_mask: CPU to consider for the grouping
+ * @nummasks: number of initialized cpusmasks
+ *
+ * Return: cpumask array if successful, NULL otherwise. And each element
+ * includes CPUs assigned to this group.
+ *
+ * Try to put close CPUs from viewpoint of CPU and NUMA locality into
+ * same group. Allocate present CPUs on these groups evenly.
+ */
+struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
+				       const struct cpumask *cpu_mask,
+				       unsigned int *nummasks)
+{
+	cpumask_var_t *node_to_cpumask;
+	cpumask_var_t nmsk;
+	int ret = -ENOMEM;
+	struct cpumask *masks = NULL;
+
+	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+		return NULL;
+
+	node_to_cpumask = alloc_node_to_cpumask();
+	if (!node_to_cpumask)
+		goto fail_nmsk;
+
+	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
+	if (!masks)
+		goto fail_node_to_cpumask;
+
+	build_node_to_cpumask(node_to_cpumask);
+
+	ret = __group_cpus_evenly(0, numgrps, node_to_cpumask, cpu_mask, nmsk,
+				  masks);
+
+fail_node_to_cpumask:
+	free_node_to_cpumask(node_to_cpumask);
+
+fail_nmsk:
+	free_cpumask_var(nmsk);
+	if (ret < 0) {
+		kfree(masks);
+		return NULL;
+	}
+	*nummasks = ret;
+	return masks;
+}
+EXPORT_SYMBOL_GPL(group_mask_cpus_evenly);
+
 #else /* CONFIG_SMP */
 struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
 {
@@ -442,5 +496,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
 	*nummasks = 1;
 	return masks;
 }
-#endif /* CONFIG_SMP */
 EXPORT_SYMBOL_GPL(group_cpus_evenly);
+
+struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
+				       const struct cpumask *cpu_mask,
+				       unsigned int *nummasks)
+{
+	return group_cpus_evenly(numgrps, nummasks);
+}
+EXPORT_SYMBOL_GPL(group_mask_cpus_evenly);
+#endif /* CONFIG_SMP */

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 02/10] genirq/affinity: Add cpumask to struct irq_affinity
  2025-07-02 16:33 [PATCH v7 00/10] blk: honor isolcpus configuration Daniel Wagner
  2025-07-02 16:33 ` [PATCH v7 01/10] lib/group_cpus: Add group_masks_cpus_evenly() Daniel Wagner
@ 2025-07-02 16:33 ` Daniel Wagner
  2025-07-03  6:19   ` Hannes Reinecke
  2025-07-02 16:33 ` [PATCH v7 03/10] blk-mq: add blk_mq_{online|possible}_queue_affinity Daniel Wagner
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Daniel Wagner @ 2025-07-02 16:33 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream, Daniel Wagner

Pass a cpumask to irq_create_affinity_masks as an additional constraint
to consider when creating the affinity masks. This allows the caller to
exclude specific CPUs, e.g., isolated CPUs (see the 'isolcpus' kernel
command-line parameter).

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 include/linux/interrupt.h | 16 ++++++++++------
 kernel/irq/affinity.c     | 12 ++++++++++--
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 51b6484c049345c75816c4a63b4efa813f42f27b..b1a230953514da57e30e601727cd0e94796153d3 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -284,18 +284,22 @@ struct irq_affinity_notify {
  * @nr_sets:		The number of interrupt sets for which affinity
  *			spreading is required
  * @set_size:		Array holding the size of each interrupt set
+ * @mask:		cpumask that constrains which CPUs to consider when
+ *			calculating the number and size of the interrupt sets
  * @calc_sets:		Callback for calculating the number and size
  *			of interrupt sets
  * @priv:		Private data for usage by @calc_sets, usually a
  *			pointer to driver/device specific data.
  */
 struct irq_affinity {
-	unsigned int	pre_vectors;
-	unsigned int	post_vectors;
-	unsigned int	nr_sets;
-	unsigned int	set_size[IRQ_AFFINITY_MAX_SETS];
-	void		(*calc_sets)(struct irq_affinity *, unsigned int nvecs);
-	void		*priv;
+	unsigned int		pre_vectors;
+	unsigned int		post_vectors;
+	unsigned int		nr_sets;
+	unsigned int		set_size[IRQ_AFFINITY_MAX_SETS];
+	const struct cpumask	*mask;
+	void			(*calc_sets)(struct irq_affinity *,
+					     unsigned int nvecs);
+	void			*priv;
 };
 
 /**
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 4013e6ad2b2f1cb91de12bb428b3281105f7d23b..c68156f7847a7920103e39124676d06191304ef6 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -70,7 +70,13 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	 */
 	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
 		unsigned int nr_masks, this_vecs = affd->set_size[i];
-		struct cpumask *result = group_cpus_evenly(this_vecs, &nr_masks);
+		struct cpumask *result;
+
+		if (affd->mask)
+			result = group_mask_cpus_evenly(this_vecs, affd->mask,
+							&nr_masks);
+		else
+			result = group_cpus_evenly(this_vecs, &nr_masks);
 
 		if (!result) {
 			kfree(masks);
@@ -115,7 +121,9 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec,
 	if (resv > minvec)
 		return 0;
 
-	if (affd->calc_sets) {
+	if (affd->mask) {
+		set_vecs = cpumask_weight(affd->mask);
+	} else if (affd->calc_sets) {
 		set_vecs = maxvec - resv;
 	} else {
 		cpus_read_lock();

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 03/10] blk-mq: add blk_mq_{online|possible}_queue_affinity
  2025-07-02 16:33 [PATCH v7 00/10] blk: honor isolcpus configuration Daniel Wagner
  2025-07-02 16:33 ` [PATCH v7 01/10] lib/group_cpus: Add group_masks_cpus_evenly() Daniel Wagner
  2025-07-02 16:33 ` [PATCH v7 02/10] genirq/affinity: Add cpumask to struct irq_affinity Daniel Wagner
@ 2025-07-02 16:33 ` Daniel Wagner
  2025-07-03  6:29   ` Hannes Reinecke
  2025-07-02 16:33 ` [PATCH v7 04/10] nvme-pci: use block layer helpers to constrain queue affinity Daniel Wagner
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Daniel Wagner @ 2025-07-02 16:33 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream, Daniel Wagner

Introduce blk_mq_{online|possible}_queue_affinity, which returns the
queue-to-CPU mapping constraints defined by the block layer. This allows
other subsystems (e.g., IRQ affinity setup) to respect block layer
requirements.

It is necessary to provide versions for both the online and possible CPU
masks because some drivers want to spread their I/O queues only across
online CPUs, while others prefer to use all possible CPUs. And the mask
used needs to match with the number of queues requested
(see blk_num_{online|possible}_queues).

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq-cpumap.c  | 24 ++++++++++++++++++++++++
 include/linux/blk-mq.h |  2 ++
 2 files changed, 26 insertions(+)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 705da074ad6c7e88042296f21b739c6d686a72b6..8244ecf878358c0b8de84458dcd5100c2f360213 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -26,6 +26,30 @@ static unsigned int blk_mq_num_queues(const struct cpumask *mask,
 	return min_not_zero(num, max_queues);
 }
 
+/**
+ * blk_mq_possible_queue_affinity - Return block layer queue affinity
+ *
+ * Returns an affinity mask that represents the queue-to-CPU mapping
+ * requested by the block layer based on possible CPUs.
+ */
+const struct cpumask *blk_mq_possible_queue_affinity(void)
+{
+	return cpu_possible_mask;
+}
+EXPORT_SYMBOL_GPL(blk_mq_possible_queue_affinity);
+
+/**
+ * blk_mq_online_queue_affinity - Return block layer queue affinity
+ *
+ * Returns an affinity mask that represents the queue-to-CPU mapping
+ * requested by the block layer based on online CPUs.
+ */
+const struct cpumask *blk_mq_online_queue_affinity(void)
+{
+	return cpu_online_mask;
+}
+EXPORT_SYMBOL_GPL(blk_mq_online_queue_affinity);
+
 /**
  * blk_mq_num_possible_queues - Calc nr of queues for multiqueue devices
  * @max_queues:	The maximum number of queues the hardware/driver
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2a5a828f19a0ba6ff0812daf40eed67f0e12ada1..1144017dce47af82f9d010e42bfbf26fa4ddf33f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -947,6 +947,8 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 void blk_mq_unfreeze_queue_non_owner(struct request_queue *q);
 void blk_freeze_queue_start_non_owner(struct request_queue *q);
 
+const struct cpumask *blk_mq_possible_queue_affinity(void);
+const struct cpumask *blk_mq_online_queue_affinity(void);
 unsigned int blk_mq_num_possible_queues(unsigned int max_queues);
 unsigned int blk_mq_num_online_queues(unsigned int max_queues);
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap);

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 04/10] nvme-pci: use block layer helpers to constrain queue affinity
  2025-07-02 16:33 [PATCH v7 00/10] blk: honor isolcpus configuration Daniel Wagner
                   ` (2 preceding siblings ...)
  2025-07-02 16:33 ` [PATCH v7 03/10] blk-mq: add blk_mq_{online|possible}_queue_affinity Daniel Wagner
@ 2025-07-02 16:33 ` Daniel Wagner
  2025-07-03  6:29   ` Hannes Reinecke
  2025-07-02 16:33 ` [PATCH v7 05/10] scsi: Use " Daniel Wagner
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Daniel Wagner @ 2025-07-02 16:33 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream, Daniel Wagner

Ensure that IRQ affinity setup also respects the queue-to-CPU mapping
constraints provided by the block layer. This allows the NVMe driver
to avoid assigning interrupts to CPUs that the block layer has excluded
(e.g., isolated CPUs).

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/nvme/host/pci.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a509dc1f1d1400bc0de6d2f9424c126d9b966751..5293d5a3e5ee19427bec834741258be134bdc2c9 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2589,6 +2589,7 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 		.pre_vectors	= 1,
 		.calc_sets	= nvme_calc_irq_sets,
 		.priv		= dev,
+		.mask		= blk_mq_possible_queue_affinity(),
 	};
 	unsigned int irq_queues, poll_queues;
 	unsigned int flags = PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY;

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 05/10] scsi: Use block layer helpers to constrain queue affinity
  2025-07-02 16:33 [PATCH v7 00/10] blk: honor isolcpus configuration Daniel Wagner
                   ` (3 preceding siblings ...)
  2025-07-02 16:33 ` [PATCH v7 04/10] nvme-pci: use block layer helpers to constrain queue affinity Daniel Wagner
@ 2025-07-02 16:33 ` Daniel Wagner
  2025-07-03  6:43   ` Hannes Reinecke
  2025-07-02 16:33 ` [PATCH v7 06/10] virtio: blk/scsi: use " Daniel Wagner
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Daniel Wagner @ 2025-07-02 16:33 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream, Daniel Wagner

Ensure that IRQ affinity setup also respects the queue-to-CPU mapping
constraints provided by the block layer. This allows the SCSI drivers
to avoid assigning interrupts to CPUs that the block layer has excluded
(e.g., isolated CPUs).

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/scsi/fnic/fnic_isr.c              | 7 +++++--
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    | 1 +
 drivers/scsi/megaraid/megaraid_sas_base.c | 5 ++++-
 drivers/scsi/mpi3mr/mpi3mr_fw.c           | 6 +++++-
 drivers/scsi/mpt3sas/mpt3sas_base.c       | 5 ++++-
 drivers/scsi/pm8001/pm8001_init.c         | 1 +
 drivers/scsi/qla2xxx/qla_isr.c            | 1 +
 drivers/scsi/smartpqi/smartpqi_init.c     | 7 +++++--
 8 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/fnic/fnic_isr.c b/drivers/scsi/fnic/fnic_isr.c
index 7ed50b11afa6a992c9a69dc746d271376ea8fe08..c6d8582c767edd8121a07966ed527260a28e5cb5 100644
--- a/drivers/scsi/fnic/fnic_isr.c
+++ b/drivers/scsi/fnic/fnic_isr.c
@@ -245,6 +245,9 @@ int fnic_set_intr_mode_msix(struct fnic *fnic)
 	unsigned int m = ARRAY_SIZE(fnic->wq);
 	unsigned int o = ARRAY_SIZE(fnic->hw_copy_wq);
 	unsigned int min_irqs = n + m + 1 + 1; /*rq, raw wq, wq, err*/
+	struct irq_affinity desc = {
+		.mask = blk_mq_online_queue_affinity(),
+	};
 
 	/*
 	 * We need n RQs, m WQs, o Copy WQs, n+m+o CQs, and n+m+o+1 INTRs
@@ -263,8 +266,8 @@ int fnic_set_intr_mode_msix(struct fnic *fnic)
 		int vec_count = 0;
 		int vecs = fnic->rq_count + fnic->raw_wq_count + fnic->wq_copy_count + 1;
 
-		vec_count = pci_alloc_irq_vectors(fnic->pdev, min_irqs, vecs,
-					PCI_IRQ_MSIX | PCI_IRQ_AFFINITY);
+		vec_count = pci_alloc_irq_vectors_affinity(fnic->pdev, min_irqs, vecs,
+					PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, &desc);
 		FNIC_ISR_DBG(KERN_INFO, fnic->host, fnic->fnic_num,
 					"allocated %d MSI-X vectors\n",
 					vec_count);
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index bc5d5356dd00710277e4b8877798f64c9674d5de..2906dd9a6c895827e07b1ba0540f0f27ac704f47 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -2607,6 +2607,7 @@ static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
 	struct pci_dev *pdev = hisi_hba->pci_dev;
 	struct irq_affinity desc = {
 		.pre_vectors = BASE_VECTORS_V3_HW,
+		.mask = blk_mq_online_queue_affinity(),
 	};
 
 	min_msi = MIN_AFFINE_VECTORS_V3_HW;
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 615e06fd4ee8e5d1c14ef912460962eacb450c04..c8df2dc47689a5dad02e1364de1d71e24f6159d0 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -5927,7 +5927,10 @@ static int
 __megasas_alloc_irq_vectors(struct megasas_instance *instance)
 {
 	int i, irq_flags;
-	struct irq_affinity desc = { .pre_vectors = instance->low_latency_index_start };
+	struct irq_affinity desc = {
+		.pre_vectors = instance->low_latency_index_start,
+		.mask = blk_mq_online_queue_affinity(),
+	};
 	struct irq_affinity *descp = &desc;
 
 	irq_flags = PCI_IRQ_MSIX;
diff --git a/drivers/scsi/mpi3mr/mpi3mr_fw.c b/drivers/scsi/mpi3mr/mpi3mr_fw.c
index 1d7901a8f0e40658b78415704e8c81e28ef6d3df..c790d50cda36dc2c33571e29fdd7f661b85a48b1 100644
--- a/drivers/scsi/mpi3mr/mpi3mr_fw.c
+++ b/drivers/scsi/mpi3mr/mpi3mr_fw.c
@@ -820,7 +820,11 @@ static int mpi3mr_setup_isr(struct mpi3mr_ioc *mrioc, u8 setup_one)
 	int max_vectors, min_vec;
 	int retval;
 	int i;
-	struct irq_affinity desc = { .pre_vectors =  1, .post_vectors = 1 };
+	struct irq_affinity desc = {
+		.pre_vectors =  1,
+		.post_vectors = 1,
+		.mask = blk_mq_online_queue_affinity(),
+	};
 
 	if (mrioc->is_intr_info_set)
 		return 0;
diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c b/drivers/scsi/mpt3sas/mpt3sas_base.c
index bd3efa5b46c780d43fae58c12f0bce5057dcda85..a55dd75221a6079a29f6ebee402b3654b94411c1 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -3364,7 +3364,10 @@ static int
 _base_alloc_irq_vectors(struct MPT3SAS_ADAPTER *ioc)
 {
 	int i, irq_flags = PCI_IRQ_MSIX;
-	struct irq_affinity desc = { .pre_vectors = ioc->high_iops_queues };
+	struct irq_affinity desc = {
+		.pre_vectors = ioc->high_iops_queues,
+		.mask = blk_mq_online_queue_affinity(),
+	};
 	struct irq_affinity *descp = &desc;
 	/*
 	 * Don't allocate msix vectors for poll_queues.
diff --git a/drivers/scsi/pm8001/pm8001_init.c b/drivers/scsi/pm8001/pm8001_init.c
index 599410bcdfea59aba40e3dd6749434b7b5966d48..1d4807eeed75acdfe091a3c0560a926ebb59e1e8 100644
--- a/drivers/scsi/pm8001/pm8001_init.c
+++ b/drivers/scsi/pm8001/pm8001_init.c
@@ -977,6 +977,7 @@ static u32 pm8001_setup_msix(struct pm8001_hba_info *pm8001_ha)
 		 */
 		struct irq_affinity desc = {
 			.pre_vectors = 1,
+			.mask = blk_mq_online_queue_affinity(),
 		};
 		rc = pci_alloc_irq_vectors_affinity(
 				pm8001_ha->pdev, 2, PM8001_MAX_MSIX_VEC,
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index c4c6b5c6658c0734f7ff68bcc31b33dde87296dd..7c5adadfd731f0e395ea0050f105196ab9a503e9 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -4522,6 +4522,7 @@ qla24xx_enable_msix(struct qla_hw_data *ha, struct rsp_que *rsp)
 	int min_vecs = QLA_BASE_VECTORS;
 	struct irq_affinity desc = {
 		.pre_vectors = QLA_BASE_VECTORS,
+		.mask = blk_mq_online_queue_affinity(),
 	};
 
 	if (QLA_TGT_MODE_ENABLED() && (ql2xenablemsix != 0) &&
diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index 125944941601e683e9aa9d4fc6a346230bef904b..24338919120e341a54d610b6fedc29a9cc29055b 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -4109,13 +4109,16 @@ static int pqi_enable_msix_interrupts(struct pqi_ctrl_info *ctrl_info)
 {
 	int num_vectors_enabled;
 	unsigned int flags = PCI_IRQ_MSIX;
+	struct irq_affinity desc = {
+		.mask = blk_mq_online_queue_affinity(),
+	};
 
 	if (!pqi_disable_managed_interrupts)
 		flags |= PCI_IRQ_AFFINITY;
 
-	num_vectors_enabled = pci_alloc_irq_vectors(ctrl_info->pci_dev,
+	num_vectors_enabled = pci_alloc_irq_vectors_affinity(ctrl_info->pci_dev,
 			PQI_MIN_MSIX_VECTORS, ctrl_info->num_queue_groups,
-			flags);
+			flags, &desc);
 	if (num_vectors_enabled < 0) {
 		dev_err(&ctrl_info->pci_dev->dev,
 			"MSI-X init failed with error %d\n",

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 06/10] virtio: blk/scsi: use block layer helpers to constrain queue affinity
  2025-07-02 16:33 [PATCH v7 00/10] blk: honor isolcpus configuration Daniel Wagner
                   ` (4 preceding siblings ...)
  2025-07-02 16:33 ` [PATCH v7 05/10] scsi: Use " Daniel Wagner
@ 2025-07-02 16:33 ` Daniel Wagner
  2025-07-03  6:43   ` Hannes Reinecke
  2025-07-02 16:33 ` [PATCH v7 07/10] isolation: Introduce io_queue isolcpus type Daniel Wagner
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Daniel Wagner @ 2025-07-02 16:33 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream, Daniel Wagner

Ensure that IRQ affinity setup also respects the queue-to-CPU mapping
constraints provided by the block layer. This allows the virtio drivers
to avoid assigning interrupts to CPUs that the block layer has excluded
(e.g., isolated CPUs).

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/block/virtio_blk.c | 4 +++-
 drivers/scsi/virtio_scsi.c | 5 ++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index e649fa67bac16b4f0c6e8e8f0e6bec111897c355..41b06540c7fb22fd1d2708338c514947c4bdeefe 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -963,7 +963,9 @@ static int init_vq(struct virtio_blk *vblk)
 	unsigned short num_vqs;
 	unsigned short num_poll_vqs;
 	struct virtio_device *vdev = vblk->vdev;
-	struct irq_affinity desc = { 0, };
+	struct irq_affinity desc = {
+		.mask = blk_mq_possible_queue_affinity(),
+	};
 
 	err = virtio_cread_feature(vdev, VIRTIO_BLK_F_MQ,
 				   struct virtio_blk_config, num_queues,
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 96a69edddbe5555574fc8fed1ba7c82a99df4472..67dfb265bf9e54adc68978ac8d93187e6629c330 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -842,7 +842,10 @@ static int virtscsi_init(struct virtio_device *vdev,
 	u32 num_vqs, num_poll_vqs, num_req_vqs;
 	struct virtqueue_info *vqs_info;
 	struct virtqueue **vqs;
-	struct irq_affinity desc = { .pre_vectors = 2 };
+	struct irq_affinity desc = {
+		.pre_vectors = 2,
+		.mask = blk_mq_possible_queue_affinity(),
+	};
 
 	num_req_vqs = vscsi->num_queues;
 	num_vqs = num_req_vqs + VIRTIO_SCSI_VQ_BASE;

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 07/10] isolation: Introduce io_queue isolcpus type
  2025-07-02 16:33 [PATCH v7 00/10] blk: honor isolcpus configuration Daniel Wagner
                   ` (5 preceding siblings ...)
  2025-07-02 16:33 ` [PATCH v7 06/10] virtio: blk/scsi: use " Daniel Wagner
@ 2025-07-02 16:33 ` Daniel Wagner
  2025-07-08  1:26   ` Aaron Tomlin
  2025-07-02 16:33 ` [PATCH v7 08/10] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Daniel Wagner @ 2025-07-02 16:33 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream, Daniel Wagner

Multiqueue drivers spread I/O queues across all CPUs for optimal
performance. However, these drivers are not aware of CPU isolation
requirements and will distribute queues without considering the isolcpus
configuration.

Introduce a new isolcpus mask that allows users to define which CPUs
should have I/O queues assigned. This is similar to managed_irq, but
intended for drivers that do not use the managed IRQ infrastructure

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 include/linux/sched/isolation.h | 1 +
 kernel/sched/isolation.c        | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index d8501f4709b583b8a1c91574446382f093bccdb1..6b6ae9c5b2f61a93c649a98ea27482b932627fca 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -9,6 +9,7 @@
 enum hk_type {
 	HK_TYPE_DOMAIN,
 	HK_TYPE_MANAGED_IRQ,
+	HK_TYPE_IO_QUEUE,
 	HK_TYPE_KERNEL_NOISE,
 	HK_TYPE_MAX,
 
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 93b038d48900a304a29ecc0c8aa8b7d419ea1397..c8cb0cf2b15a11524be73826f38bb2a0709c449c 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -11,6 +11,7 @@
 enum hk_flags {
 	HK_FLAG_DOMAIN		= BIT(HK_TYPE_DOMAIN),
 	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
+	HK_FLAG_IO_QUEUE	= BIT(HK_TYPE_IO_QUEUE),
 	HK_FLAG_KERNEL_NOISE	= BIT(HK_TYPE_KERNEL_NOISE),
 };
 
@@ -224,6 +225,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
 			continue;
 		}
 
+		if (!strncmp(str, "io_queue,", 9)) {
+			str += 9;
+			flags |= HK_FLAG_IO_QUEUE;
+			continue;
+		}
+
 		/*
 		 * Skip unknown sub-parameter and validate that it is not
 		 * containing an invalid character.

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 08/10] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-07-02 16:33 [PATCH v7 00/10] blk: honor isolcpus configuration Daniel Wagner
                   ` (6 preceding siblings ...)
  2025-07-02 16:33 ` [PATCH v7 07/10] isolation: Introduce io_queue isolcpus type Daniel Wagner
@ 2025-07-02 16:33 ` Daniel Wagner
  2025-07-03  6:58   ` Hannes Reinecke
                     ` (2 more replies)
  2025-07-02 16:33 ` [PATCH v7 09/10] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Daniel Wagner
  2025-07-02 16:34 ` [PATCH v7 10/10] docs: add io_queue flag to isolcpus Daniel Wagner
  9 siblings, 3 replies; 36+ messages in thread
From: Daniel Wagner @ 2025-07-02 16:33 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream, Daniel Wagner

Extend the capabilities of the generic CPU to hardware queue (hctx)
mapping code, so it maps houskeeping CPUs and isolated CPUs to the
hardware queues evenly.

A hctx is only operational when there is at least one online
housekeeping CPU assigned (aka active_hctx). Thus, check the final
mapping that there is no hctx which has only offline housekeeing CPU and
online isolated CPUs.

Example mapping result:

  16 online CPUs

  isolcpus=io_queue,2-3,6-7,12-13

Queue mapping:
        hctx0: default 0 2
        hctx1: default 1 3
        hctx2: default 4 6
        hctx3: default 5 7
        hctx4: default 8 12
        hctx5: default 9 13
        hctx6: default 10
        hctx7: default 11
        hctx8: default 14
        hctx9: default 15

IRQ mapping:
        irq 42 affinity 0 effective 0  nvme0q0
        irq 43 affinity 0 effective 0  nvme0q1
        irq 44 affinity 1 effective 1  nvme0q2
        irq 45 affinity 4 effective 4  nvme0q3
        irq 46 affinity 5 effective 5  nvme0q4
        irq 47 affinity 8 effective 8  nvme0q5
        irq 48 affinity 9 effective 9  nvme0q6
        irq 49 affinity 10 effective 10  nvme0q7
        irq 50 affinity 11 effective 11  nvme0q8
        irq 51 affinity 14 effective 14  nvme0q9
        irq 52 affinity 15 effective 15  nvme0q10

A corner case is when the number of online CPUs and present CPUs
differ and the driver asks for less queues than online CPUs, e.g.

  8 online CPUs, 16 possible CPUs

  isolcpus=io_queue,2-3,6-7,12-13
  virtio_blk.num_request_queues=2

Queue mapping:
        hctx0: default 0 1 2 3 4 5 6 7 8 12 13
        hctx1: default 9 10 11 14 15

IRQ mapping
        irq 27 affinity 0 effective 0 virtio0-config
        irq 28 affinity 0-1,4-5,8 effective 5 virtio0-req.0
        irq 29 affinity 9-11,14-15 effective 0 virtio0-req.1

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq-cpumap.c | 194 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 191 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 8244ecf878358c0b8de84458dcd5100c2f360213..4cb2724a78e13216e50f0e6b1a18f19ea41a54f8 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -17,12 +17,25 @@
 #include "blk.h"
 #include "blk-mq.h"
 
+static struct cpumask blk_hk_online_mask;
+
 static unsigned int blk_mq_num_queues(const struct cpumask *mask,
 				      unsigned int max_queues)
 {
 	unsigned int num;
 
-	num = cpumask_weight(mask);
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
+		const struct cpumask *hk_mask;
+		struct cpumask avail_mask;
+
+		hk_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+		cpumask_and(&avail_mask, mask, hk_mask);
+
+		num = cpumask_weight(&avail_mask);
+	} else {
+		num = cpumask_weight(mask);
+	}
+
 	return min_not_zero(num, max_queues);
 }
 
@@ -31,9 +44,13 @@ static unsigned int blk_mq_num_queues(const struct cpumask *mask,
  *
  * Returns an affinity mask that represents the queue-to-CPU mapping
  * requested by the block layer based on possible CPUs.
+ * This helper takes isolcpus settings into account.
  */
 const struct cpumask *blk_mq_possible_queue_affinity(void)
 {
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
+		return housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+
 	return cpu_possible_mask;
 }
 EXPORT_SYMBOL_GPL(blk_mq_possible_queue_affinity);
@@ -46,6 +63,12 @@ EXPORT_SYMBOL_GPL(blk_mq_possible_queue_affinity);
  */
 const struct cpumask *blk_mq_online_queue_affinity(void)
 {
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
+		cpumask_and(&blk_hk_online_mask, cpu_online_mask,
+			    housekeeping_cpumask(HK_TYPE_IO_QUEUE));
+		return &blk_hk_online_mask;
+	}
+
 	return cpu_online_mask;
 }
 EXPORT_SYMBOL_GPL(blk_mq_online_queue_affinity);
@@ -57,7 +80,8 @@ EXPORT_SYMBOL_GPL(blk_mq_online_queue_affinity);
  *		ignored.
  *
  * Calculates the number of queues to be used for a multiqueue
- * device based on the number of possible CPUs.
+ * device based on the number of possible CPUs. This helper
+ * takes isolcpus settings into account.
  */
 unsigned int blk_mq_num_possible_queues(unsigned int max_queues)
 {
@@ -72,7 +96,8 @@ EXPORT_SYMBOL_GPL(blk_mq_num_possible_queues);
  *		ignored.
  *
  * Calculates the number of queues to be used for a multiqueue
- * device based on the number of online CPUs.
+ * device based on the number of online CPUs. This helper
+ * takes isolcpus settings into account.
  */
 unsigned int blk_mq_num_online_queues(unsigned int max_queues)
 {
@@ -80,11 +105,169 @@ unsigned int blk_mq_num_online_queues(unsigned int max_queues)
 }
 EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
 
+static bool blk_mq_hk_validate(struct blk_mq_queue_map *qmap,
+			       const struct cpumask *active_hctx)
+{
+	/*
+	 * Verify if the mapping is usable.
+	 *
+	 * First, mark all hctx which have at least online houskeeping
+	 * CPU assigned.
+	 */
+	for (int queue = 0; queue < qmap->nr_queues; queue++) {
+		int cpu;
+
+		if (cpumask_test_cpu(queue, active_hctx)) {
+			/*
+			 * This htcx has at least one online houskeeping
+			 * CPU thus it is able to serve any assigned
+			 * isolated CPU.
+			 */
+			continue;
+		}
+
+		/*
+		 * There is no online houskeeping CPU for this hctx, all
+		 * good as long as all isolated CPUs are also offline.
+		 */
+		for_each_online_cpu(cpu) {
+			if (qmap->mq_map[cpu] != queue)
+				continue;
+
+			pr_warn("Unable to create a usable CPU-to-queue mapping with the given constraints\n");
+			return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * blk_mq_map_hk_queues - Create housekeeping CPU to
+ *                        hardware queue mapping
+ * @qmap:	CPU to hardware queue map
+ *
+ * Create a housekeeping CPU to hardware queue mapping in @qmap. @qmap
+ * contains a valid configuration honoring the isolcpus configuration.
+ */
+static void blk_mq_map_hk_queues(struct blk_mq_queue_map *qmap)
+{
+	cpumask_var_t active_hctx __free(free_cpumask_var) = NULL;
+	struct cpumask *hk_masks __free(kfree) = NULL;
+	const struct cpumask *mask;
+	unsigned int queue, cpu, nr_masks;
+
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
+		mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+	else
+		goto fallback;
+
+	if (!zalloc_cpumask_var(&active_hctx, GFP_KERNEL))
+		goto fallback;
+
+	/* Map housekeeping CPUs to a hctx */
+	hk_masks = group_mask_cpus_evenly(qmap->nr_queues, mask, &nr_masks);
+	if (!hk_masks)
+		goto fallback;
+
+	for (queue = 0; queue < qmap->nr_queues; queue++) {
+		unsigned int idx = (qmap->queue_offset + queue) % nr_masks;
+
+		for_each_cpu(cpu, &hk_masks[idx]) {
+			qmap->mq_map[cpu] = idx;
+
+			if (cpu_online(cpu))
+				cpumask_set_cpu(qmap->mq_map[cpu], active_hctx);
+		}
+	}
+
+	/* Map isolcpus to hardware context */
+	queue = cpumask_first(active_hctx);
+	for_each_cpu_andnot(cpu, cpu_possible_mask, mask) {
+		qmap->mq_map[cpu] = (qmap->queue_offset + queue) % nr_masks;
+		queue = cpumask_next_wrap(queue, active_hctx);
+	}
+
+	if (!blk_mq_hk_validate(qmap, active_hctx))
+		goto fallback;
+
+	return;
+
+fallback:
+	/*
+	 * Map all CPUs to the first hctx to ensure at least one online
+	 * housekeeping CPU is serving it.
+	 */
+	for_each_possible_cpu(cpu)
+		qmap->mq_map[cpu] = 0;
+}
+
+/*
+ * blk_mq_map_hk_irq_queues - Create housekeeping CPU to
+ *                            hardware queue mapping
+ * @dev:	The device to map queues
+ * @qmap:	CPU to hardware queue map
+ * @offset:	Queue offset to use for the device
+ *
+ * Create a housekeeping CPU to hardware queue mapping in @qmap. @qmap
+ * contains a valid configuration honoring the isolcpus configuration.
+ */
+static void blk_mq_map_hk_irq_queues(struct device *dev,
+				     struct blk_mq_queue_map *qmap,
+				     int offset)
+{
+	cpumask_var_t active_hctx __free(free_cpumask_var) = NULL;
+	cpumask_var_t mask __free(free_cpumask_var) = NULL;
+	unsigned int queue, cpu;
+
+	if (!zalloc_cpumask_var(&active_hctx, GFP_KERNEL))
+		goto fallback;
+
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+		goto fallback;
+
+	/* Map housekeeping CPUs to a hctx */
+	for (queue = 0; queue < qmap->nr_queues; queue++) {
+		for_each_cpu(cpu, dev->bus->irq_get_affinity(dev, offset + queue)) {
+			qmap->mq_map[cpu] = qmap->queue_offset + queue;
+
+			cpumask_set_cpu(cpu, mask);
+			if (cpu_online(cpu))
+				cpumask_set_cpu(qmap->mq_map[cpu], active_hctx);
+		}
+	}
+
+	/* Map isolcpus to hardware context */
+	queue = cpumask_first(active_hctx);
+	for_each_cpu_andnot(cpu, cpu_possible_mask, mask) {
+		qmap->mq_map[cpu] = qmap->queue_offset + queue;
+		queue = cpumask_next_wrap(queue, active_hctx);
+	}
+
+	if (!blk_mq_hk_validate(qmap, active_hctx))
+		goto fallback;
+
+	return;
+
+fallback:
+	/*
+	 * Map all CPUs to the first hctx to ensure at least one online
+	 * housekeeping CPU is serving it.
+	 */
+	for_each_possible_cpu(cpu)
+		qmap->mq_map[cpu] = 0;
+}
+
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
 	const struct cpumask *masks;
 	unsigned int queue, cpu, nr_masks;
 
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
+		blk_mq_map_hk_queues(qmap);
+		return;
+	}
+
 	masks = group_cpus_evenly(qmap->nr_queues, &nr_masks);
 	if (!masks) {
 		for_each_possible_cpu(cpu)
@@ -139,6 +322,11 @@ void blk_mq_map_hw_queues(struct blk_mq_queue_map *qmap,
 	if (!dev->bus->irq_get_affinity)
 		goto fallback;
 
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
+		blk_mq_map_hk_irq_queues(dev, qmap, offset);
+		return;
+	}
+
 	for (queue = 0; queue < qmap->nr_queues; queue++) {
 		mask = dev->bus->irq_get_affinity(dev, queue + offset);
 		if (!mask)

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 09/10] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
  2025-07-02 16:33 [PATCH v7 00/10] blk: honor isolcpus configuration Daniel Wagner
                   ` (7 preceding siblings ...)
  2025-07-02 16:33 ` [PATCH v7 08/10] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
@ 2025-07-02 16:33 ` Daniel Wagner
  2025-07-03  6:58   ` Hannes Reinecke
                     ` (2 more replies)
  2025-07-02 16:34 ` [PATCH v7 10/10] docs: add io_queue flag to isolcpus Daniel Wagner
  9 siblings, 3 replies; 36+ messages in thread
From: Daniel Wagner @ 2025-07-02 16:33 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream, Daniel Wagner

When isolcpus=io_queue is enabled, and the last housekeeping CPU for a
given hctx goes offline, there would be no CPU left to handle I/O. To
prevent I/O stalls, prevent offlining housekeeping CPUs that are still
serving isolated CPUs.

When isolcpus=io_queue is enabled and the last housekeeping CPU
for a given hctx goes offline, no CPU would be left to handle I/O.
To prevent I/O stalls, disallow offlining housekeeping CPUs that are
still serving isolated CPUs.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0c61492724d228736f975f1d8f195515603801b6..87240644f73ed0490a5459e042c68e0e168f727d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3681,6 +3681,43 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
 	return data.has_rq;
 }
 
+static bool blk_mq_hctx_can_offline_hk_cpu(struct blk_mq_hw_ctx *hctx,
+					   unsigned int this_cpu)
+{
+	const struct cpumask *hk_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+
+	for (int i = 0; i < hctx->nr_ctx; i++) {
+		struct blk_mq_ctx *ctx = hctx->ctxs[i];
+
+		if (ctx->cpu == this_cpu)
+			continue;
+
+		/*
+		 * Check if this context has at least one online
+		 * housekeeping CPU; in this case the hardware context is
+		 * usable.
+		 */
+		if (cpumask_test_cpu(ctx->cpu, hk_mask) &&
+		    cpu_online(ctx->cpu))
+			break;
+
+		/*
+		 * The context doesn't have any online housekeeping CPUs,
+		 * but there might be an online isolated CPU mapped to
+		 * it.
+		 */
+		if (cpu_is_offline(ctx->cpu))
+			continue;
+
+		pr_warn("%s: trying to offline hctx%d but there is still an online isolcpu CPU %d mapped to it\n",
+			hctx->queue->disk->disk_name,
+			hctx->queue_num, ctx->cpu);
+		return false;
+	}
+
+	return true;
+}
+
 static bool blk_mq_hctx_has_online_cpu(struct blk_mq_hw_ctx *hctx,
 		unsigned int this_cpu)
 {
@@ -3712,6 +3749,11 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
 			struct blk_mq_hw_ctx, cpuhp_online);
 
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
+		if (!blk_mq_hctx_can_offline_hk_cpu(hctx, cpu))
+			return -EINVAL;
+	}
+
 	if (blk_mq_hctx_has_online_cpu(hctx, cpu))
 		return 0;
 

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 10/10] docs: add io_queue flag to isolcpus
  2025-07-02 16:33 [PATCH v7 00/10] blk: honor isolcpus configuration Daniel Wagner
                   ` (8 preceding siblings ...)
  2025-07-02 16:33 ` [PATCH v7 09/10] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Daniel Wagner
@ 2025-07-02 16:34 ` Daniel Wagner
  2025-07-03  6:59   ` Hannes Reinecke
  2025-07-08  1:26   ` Aaron Tomlin
  9 siblings, 2 replies; 36+ messages in thread
From: Daniel Wagner @ 2025-07-02 16:34 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream, Daniel Wagner

The io_queue flag informs multiqueue device drivers where to place
hardware queues. Document this new flag in the isolcpus
command-line argument description.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 Documentation/admin-guide/kernel-parameters.txt | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f1f2c0874da9ddfc95058c464fdf5dabaf0de713..7594ac5448575cc895ebf7af0fe051d42dc5e0e9 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2590,7 +2590,6 @@
 			  "number of CPUs in system - 1".
 
 			managed_irq
-
 			  Isolate from being targeted by managed interrupts
 			  which have an interrupt mask containing isolated
 			  CPUs. The affinity of managed interrupts is
@@ -2613,6 +2612,24 @@
 			  housekeeping CPUs has no influence on those
 			  queues.
 
+			io_queue
+			  Isolate from I/O queue work caused by multiqueue
+			  device drivers. Restrict the placement of
+			  queues to housekeeping CPUs only, ensuring that
+			  all I/O work is processed by a housekeeping CPU.
+
+			  Housekeeping CPUs that serve isolated CPUs
+			  cannot be offlined.
+
+			  The io_queue configuration takes precedence over
+			  managed_irq; thus, when io_queue is used,
+			  managed_irq has no effect.
+
+			  Note: When an isolated CPU issues an I/O request,
+			  it is forwarded to a housekeeping CPU. This will
+			  trigger a software interrupt on the completion
+			  path.
+
 			The format of <cpu-list> is described above.
 
 	iucv=		[HW,NET]

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 01/10] lib/group_cpus: Add group_masks_cpus_evenly()
  2025-07-02 16:33 ` [PATCH v7 01/10] lib/group_cpus: Add group_masks_cpus_evenly() Daniel Wagner
@ 2025-07-03  6:18   ` Hannes Reinecke
  2025-09-03 12:36     ` Daniel Wagner
  2025-07-11  8:28   ` John Garry
  1 sibling, 1 reply; 36+ messages in thread
From: Hannes Reinecke @ 2025-07-03  6:18 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Mathieu Desnoyers,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 7/2/25 18:33, Daniel Wagner wrote:
> group_mask_cpus_evenly() allows the caller to pass in a CPU mask that
> should be evenly distributed. This new function is a more generic
> version of the existing group_cpus_evenly(), which always distributes
> all present CPUs into groups.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   include/linux/group_cpus.h |  3 +++
>   lib/group_cpus.c           | 64 +++++++++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 66 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/group_cpus.h b/include/linux/group_cpus.h
> index 9d4e5ab6c314b31c09fda82c3f6ac18f77e9de36..d4604dce1316a08400e982039006331f34c18ee8 100644
> --- a/include/linux/group_cpus.h
> +++ b/include/linux/group_cpus.h
> @@ -10,5 +10,8 @@
>   #include <linux/cpu.h>
>   
>   struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks);
> +struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
> +				       const struct cpumask *cpu_mask,
> +				       unsigned int *nummasks);
>   
>   #endif
> diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> index 6d08ac05f371bf880571507d935d9eb501616a84..00c9b7a10c8acd29239fe20d2a30fdae22ef74a5 100644
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -8,6 +8,7 @@
>   #include <linux/cpu.h>
>   #include <linux/sort.h>
>   #include <linux/group_cpus.h>
> +#include <linux/sched/isolation.h>
>   
>   #ifdef CONFIG_SMP
>   
> @@ -425,6 +426,59 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
>   	*nummasks = min(nr_present + nr_others, numgrps);
>   	return masks;
>   }
> +EXPORT_SYMBOL_GPL(group_cpus_evenly);
> +
> +/**
> + * group_mask_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
> + * @numgrps: number of groups
> + * @cpu_mask: CPU to consider for the grouping
> + * @nummasks: number of initialized cpusmasks
> + *
> + * Return: cpumask array if successful, NULL otherwise. And each element
> + * includes CPUs assigned to this group.
> + *
> + * Try to put close CPUs from viewpoint of CPU and NUMA locality into
> + * same group. Allocate present CPUs on these groups evenly.
> + */

Description could be improved. Point is that you do not do any
calculation here, you just call __group_cpus_evenly() with
a different mask.

> +struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
> +				       const struct cpumask *cpu_mask,
> +				       unsigned int *nummasks)
> +{
> +	cpumask_var_t *node_to_cpumask;
> +	cpumask_var_t nmsk;
> +	int ret = -ENOMEM;
> +	struct cpumask *masks = NULL;
> +
> +	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
> +		return NULL;
> +
> +	node_to_cpumask = alloc_node_to_cpumask();
> +	if (!node_to_cpumask)
> +		goto fail_nmsk;
> +
> +	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
> +	if (!masks)
> +		goto fail_node_to_cpumask;
> +
> +	build_node_to_cpumask(node_to_cpumask);
> +
> +	ret = __group_cpus_evenly(0, numgrps, node_to_cpumask, cpu_mask, nmsk,
> +				  masks);
> +
> +fail_node_to_cpumask:
> +	free_node_to_cpumask(node_to_cpumask);
> +
> +fail_nmsk:
> +	free_cpumask_var(nmsk);
> +	if (ret < 0) {
> +		kfree(masks);
> +		return NULL;
> +	}
> +	*nummasks = ret;
> +	return masks;
> +}
> +EXPORT_SYMBOL_GPL(group_mask_cpus_evenly);
> +
>   #else /* CONFIG_SMP */
>   struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
>   {
> @@ -442,5 +496,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
>   	*nummasks = 1;
>   	return masks;
>   }
> -#endif /* CONFIG_SMP */
>   EXPORT_SYMBOL_GPL(group_cpus_evenly);
> +
> +struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
> +				       const struct cpumask *cpu_mask,
> +				       unsigned int *nummasks)
> +{
> +	return group_cpus_evenly(numgrps, nummasks);
> +}
> +EXPORT_SYMBOL_GPL(group_mask_cpus_evenly);
> +#endif /* CONFIG_SMP */
> 

Otherwise:
Reviewed-by: Hannes Reinecke <hare@suse.de.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 02/10] genirq/affinity: Add cpumask to struct irq_affinity
  2025-07-02 16:33 ` [PATCH v7 02/10] genirq/affinity: Add cpumask to struct irq_affinity Daniel Wagner
@ 2025-07-03  6:19   ` Hannes Reinecke
  0 siblings, 0 replies; 36+ messages in thread
From: Hannes Reinecke @ 2025-07-03  6:19 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Mathieu Desnoyers,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 7/2/25 18:33, Daniel Wagner wrote:
> Pass a cpumask to irq_create_affinity_masks as an additional constraint
> to consider when creating the affinity masks. This allows the caller to
> exclude specific CPUs, e.g., isolated CPUs (see the 'isolcpus' kernel
> command-line parameter).
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   include/linux/interrupt.h | 16 ++++++++++------
>   kernel/irq/affinity.c     | 12 ++++++++++--
>   2 files changed, 20 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 03/10] blk-mq: add blk_mq_{online|possible}_queue_affinity
  2025-07-02 16:33 ` [PATCH v7 03/10] blk-mq: add blk_mq_{online|possible}_queue_affinity Daniel Wagner
@ 2025-07-03  6:29   ` Hannes Reinecke
  0 siblings, 0 replies; 36+ messages in thread
From: Hannes Reinecke @ 2025-07-03  6:29 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Mathieu Desnoyers,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 7/2/25 18:33, Daniel Wagner wrote:
> Introduce blk_mq_{online|possible}_queue_affinity, which returns the
> queue-to-CPU mapping constraints defined by the block layer. This allows
> other subsystems (e.g., IRQ affinity setup) to respect block layer
> requirements.
> 
> It is necessary to provide versions for both the online and possible CPU
> masks because some drivers want to spread their I/O queues only across
> online CPUs, while others prefer to use all possible CPUs. And the mask
> used needs to match with the number of queues requested
> (see blk_num_{online|possible}_queues).
> 
Technically you are correct.
However, I do have the sneaking suspicion that most drivers just use
num_online_cpus() for convenience and to reduce the size of the
structures.
But I guess I'll comment on that in the patch modifying the drivers.

> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   block/blk-mq-cpumap.c  | 24 ++++++++++++++++++++++++
>   include/linux/blk-mq.h |  2 ++
>   2 files changed, 26 insertions(+)
> 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 705da074ad6c7e88042296f21b739c6d686a72b6..8244ecf878358c0b8de84458dcd5100c2f360213 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -26,6 +26,30 @@ static unsigned int blk_mq_num_queues(const struct cpumask *mask,
>   	return min_not_zero(num, max_queues);
>   }
>   
> +/**
> + * blk_mq_possible_queue_affinity - Return block layer queue affinity
> + *
> + * Returns an affinity mask that represents the queue-to-CPU mapping
> + * requested by the block layer based on possible CPUs.
> + */
> +const struct cpumask *blk_mq_possible_queue_affinity(void)
> +{
> +	return cpu_possible_mask;
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_possible_queue_affinity);
> +
> +/**
> + * blk_mq_online_queue_affinity - Return block layer queue affinity
> + *
> + * Returns an affinity mask that represents the queue-to-CPU mapping
> + * requested by the block layer based on online CPUs.
> + */
> +const struct cpumask *blk_mq_online_queue_affinity(void)
> +{
> +	return cpu_online_mask;
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_online_queue_affinity);
> +
>   /**
>    * blk_mq_num_possible_queues - Calc nr of queues for multiqueue devices
>    * @max_queues:	The maximum number of queues the hardware/driver
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 2a5a828f19a0ba6ff0812daf40eed67f0e12ada1..1144017dce47af82f9d010e42bfbf26fa4ddf33f 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -947,6 +947,8 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
>   void blk_mq_unfreeze_queue_non_owner(struct request_queue *q);
>   void blk_freeze_queue_start_non_owner(struct request_queue *q);
>   
> +const struct cpumask *blk_mq_possible_queue_affinity(void);
> +const struct cpumask *blk_mq_online_queue_affinity(void);
>   unsigned int blk_mq_num_possible_queues(unsigned int max_queues);
>   unsigned int blk_mq_num_online_queues(unsigned int max_queues);
>   void blk_mq_map_queues(struct blk_mq_queue_map *qmap);
> 

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 04/10] nvme-pci: use block layer helpers to constrain queue affinity
  2025-07-02 16:33 ` [PATCH v7 04/10] nvme-pci: use block layer helpers to constrain queue affinity Daniel Wagner
@ 2025-07-03  6:29   ` Hannes Reinecke
  0 siblings, 0 replies; 36+ messages in thread
From: Hannes Reinecke @ 2025-07-03  6:29 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Mathieu Desnoyers,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 7/2/25 18:33, Daniel Wagner wrote:
> Ensure that IRQ affinity setup also respects the queue-to-CPU mapping
> constraints provided by the block layer. This allows the NVMe driver
> to avoid assigning interrupts to CPUs that the block layer has excluded
> (e.g., isolated CPUs).
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   drivers/nvme/host/pci.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index a509dc1f1d1400bc0de6d2f9424c126d9b966751..5293d5a3e5ee19427bec834741258be134bdc2c9 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2589,6 +2589,7 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
>   		.pre_vectors	= 1,
>   		.calc_sets	= nvme_calc_irq_sets,
>   		.priv		= dev,
> +		.mask		= blk_mq_possible_queue_affinity(),
>   	};
>   	unsigned int irq_queues, poll_queues;
>   	unsigned int flags = PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY;
> 

That was easy :-)

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 05/10] scsi: Use block layer helpers to constrain queue affinity
  2025-07-02 16:33 ` [PATCH v7 05/10] scsi: Use " Daniel Wagner
@ 2025-07-03  6:43   ` Hannes Reinecke
  2025-07-04  9:37     ` Daniel Wagner
  0 siblings, 1 reply; 36+ messages in thread
From: Hannes Reinecke @ 2025-07-03  6:43 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Mathieu Desnoyers,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 7/2/25 18:33, Daniel Wagner wrote:
> Ensure that IRQ affinity setup also respects the queue-to-CPU mapping
> constraints provided by the block layer. This allows the SCSI drivers
> to avoid assigning interrupts to CPUs that the block layer has excluded
> (e.g., isolated CPUs).
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   drivers/scsi/fnic/fnic_isr.c              | 7 +++++--
>   drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    | 1 +
>   drivers/scsi/megaraid/megaraid_sas_base.c | 5 ++++-
>   drivers/scsi/mpi3mr/mpi3mr_fw.c           | 6 +++++-
>   drivers/scsi/mpt3sas/mpt3sas_base.c       | 5 ++++-
>   drivers/scsi/pm8001/pm8001_init.c         | 1 +
>   drivers/scsi/qla2xxx/qla_isr.c            | 1 +
>   drivers/scsi/smartpqi/smartpqi_init.c     | 7 +++++--
>   8 files changed, 26 insertions(+), 7 deletions(-)
 >

All of these drivers are not aware of CPU hotplug, and as such
will not be notified when the number of CPUs changes.
But you use 'blk_mq_online_queue_affinity()' for all of these
drivers.
Wouldn't 'blk_mq_possible_queue_affinit()' a better choice here
to insulate against CPU hotplug effects?

Also some drivers which are using irq affinity (eg aacraid, lpfc) are
missing from these conversions. Why?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 06/10] virtio: blk/scsi: use block layer helpers to constrain queue affinity
  2025-07-02 16:33 ` [PATCH v7 06/10] virtio: blk/scsi: use " Daniel Wagner
@ 2025-07-03  6:43   ` Hannes Reinecke
  0 siblings, 0 replies; 36+ messages in thread
From: Hannes Reinecke @ 2025-07-03  6:43 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Mathieu Desnoyers,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 7/2/25 18:33, Daniel Wagner wrote:
> Ensure that IRQ affinity setup also respects the queue-to-CPU mapping
> constraints provided by the block layer. This allows the virtio drivers
> to avoid assigning interrupts to CPUs that the block layer has excluded
> (e.g., isolated CPUs).
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   drivers/block/virtio_blk.c | 4 +++-
>   drivers/scsi/virtio_scsi.c | 5 ++++-
>   2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index e649fa67bac16b4f0c6e8e8f0e6bec111897c355..41b06540c7fb22fd1d2708338c514947c4bdeefe 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -963,7 +963,9 @@ static int init_vq(struct virtio_blk *vblk)
>   	unsigned short num_vqs;
>   	unsigned short num_poll_vqs;
>   	struct virtio_device *vdev = vblk->vdev;
> -	struct irq_affinity desc = { 0, };
> +	struct irq_affinity desc = {
> +		.mask = blk_mq_possible_queue_affinity(),
> +	};
>   
>   	err = virtio_cread_feature(vdev, VIRTIO_BLK_F_MQ,
>   				   struct virtio_blk_config, num_queues,
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 96a69edddbe5555574fc8fed1ba7c82a99df4472..67dfb265bf9e54adc68978ac8d93187e6629c330 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -842,7 +842,10 @@ static int virtscsi_init(struct virtio_device *vdev,
>   	u32 num_vqs, num_poll_vqs, num_req_vqs;
>   	struct virtqueue_info *vqs_info;
>   	struct virtqueue **vqs;
> -	struct irq_affinity desc = { .pre_vectors = 2 };
> +	struct irq_affinity desc = {
> +		.pre_vectors = 2,
> +		.mask = blk_mq_possible_queue_affinity(),
> +	};
>   
>   	num_req_vqs = vscsi->num_queues;
>   	num_vqs = num_req_vqs + VIRTIO_SCSI_VQ_BASE;
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 08/10] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-07-02 16:33 ` [PATCH v7 08/10] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
@ 2025-07-03  6:58   ` Hannes Reinecke
  2025-07-04  9:21     ` Daniel Wagner
  2025-07-03  9:01   ` Christoph Hellwig
  2025-07-03 14:47   ` kernel test robot
  2 siblings, 1 reply; 36+ messages in thread
From: Hannes Reinecke @ 2025-07-03  6:58 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Mathieu Desnoyers,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 7/2/25 18:33, Daniel Wagner wrote:
> Extend the capabilities of the generic CPU to hardware queue (hctx)
> mapping code, so it maps houskeeping CPUs and isolated CPUs to the
> hardware queues evenly.
> 
> A hctx is only operational when there is at least one online
> housekeeping CPU assigned (aka active_hctx). Thus, check the final
> mapping that there is no hctx which has only offline housekeeing CPU and
> online isolated CPUs.
> 
> Example mapping result:
> 
>    16 online CPUs
> 
>    isolcpus=io_queue,2-3,6-7,12-13
> 
> Queue mapping:
>          hctx0: default 0 2
>          hctx1: default 1 3
>          hctx2: default 4 6
>          hctx3: default 5 7
>          hctx4: default 8 12
>          hctx5: default 9 13
>          hctx6: default 10
>          hctx7: default 11
>          hctx8: default 14
>          hctx9: default 15
> 
> IRQ mapping:
>          irq 42 affinity 0 effective 0  nvme0q0
>          irq 43 affinity 0 effective 0  nvme0q1
>          irq 44 affinity 1 effective 1  nvme0q2
>          irq 45 affinity 4 effective 4  nvme0q3
>          irq 46 affinity 5 effective 5  nvme0q4
>          irq 47 affinity 8 effective 8  nvme0q5
>          irq 48 affinity 9 effective 9  nvme0q6
>          irq 49 affinity 10 effective 10  nvme0q7
>          irq 50 affinity 11 effective 11  nvme0q8
>          irq 51 affinity 14 effective 14  nvme0q9
>          irq 52 affinity 15 effective 15  nvme0q10
> 
> A corner case is when the number of online CPUs and present CPUs
> differ and the driver asks for less queues than online CPUs, e.g.
> 
>    8 online CPUs, 16 possible CPUs
> 
>    isolcpus=io_queue,2-3,6-7,12-13
>    virtio_blk.num_request_queues=2
> 
> Queue mapping:
>          hctx0: default 0 1 2 3 4 5 6 7 8 12 13
>          hctx1: default 9 10 11 14 15
> 
> IRQ mapping
>          irq 27 affinity 0 effective 0 virtio0-config
>          irq 28 affinity 0-1,4-5,8 effective 5 virtio0-req.0
>          irq 29 affinity 9-11,14-15 effective 0 virtio0-req.1
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   block/blk-mq-cpumap.c | 194 +++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 191 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 8244ecf878358c0b8de84458dcd5100c2f360213..4cb2724a78e13216e50f0e6b1a18f19ea41a54f8 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -17,12 +17,25 @@
>   #include "blk.h"
>   #include "blk-mq.h"
>   
> +static struct cpumask blk_hk_online_mask;
> +
>   static unsigned int blk_mq_num_queues(const struct cpumask *mask,
>   				      unsigned int max_queues)
>   {
>   	unsigned int num;
>   
> -	num = cpumask_weight(mask);
> +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
> +		const struct cpumask *hk_mask;
> +		struct cpumask avail_mask;
> +
> +		hk_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
> +		cpumask_and(&avail_mask, mask, hk_mask);
> +
> +		num = cpumask_weight(&avail_mask);
> +	} else {
> +		num = cpumask_weight(mask);
> +	}
> +
>   	return min_not_zero(num, max_queues);
>   }
>   
> @@ -31,9 +44,13 @@ static unsigned int blk_mq_num_queues(const struct cpumask *mask,
>    *
>    * Returns an affinity mask that represents the queue-to-CPU mapping
>    * requested by the block layer based on possible CPUs.
> + * This helper takes isolcpus settings into account.
>    */
>   const struct cpumask *blk_mq_possible_queue_affinity(void)
>   {
> +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
> +		return housekeeping_cpumask(HK_TYPE_IO_QUEUE);
> +
>   	return cpu_possible_mask;
>   }
>   EXPORT_SYMBOL_GPL(blk_mq_possible_queue_affinity);
> @@ -46,6 +63,12 @@ EXPORT_SYMBOL_GPL(blk_mq_possible_queue_affinity);
>    */
>   const struct cpumask *blk_mq_online_queue_affinity(void)
>   {
> +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
> +		cpumask_and(&blk_hk_online_mask, cpu_online_mask,
> +			    housekeeping_cpumask(HK_TYPE_IO_QUEUE));
> +		return &blk_hk_online_mask;
> +	}
> +
>   	return cpu_online_mask;
>   }
>   EXPORT_SYMBOL_GPL(blk_mq_online_queue_affinity);
> @@ -57,7 +80,8 @@ EXPORT_SYMBOL_GPL(blk_mq_online_queue_affinity);
>    *		ignored.
>    *
>    * Calculates the number of queues to be used for a multiqueue
> - * device based on the number of possible CPUs.
> + * device based on the number of possible CPUs. This helper
> + * takes isolcpus settings into account.
>    */
>   unsigned int blk_mq_num_possible_queues(unsigned int max_queues)
>   {
> @@ -72,7 +96,8 @@ EXPORT_SYMBOL_GPL(blk_mq_num_possible_queues);
>    *		ignored.
>    *
>    * Calculates the number of queues to be used for a multiqueue
> - * device based on the number of online CPUs.
> + * device based on the number of online CPUs. This helper
> + * takes isolcpus settings into account.
>    */
>   unsigned int blk_mq_num_online_queues(unsigned int max_queues)
>   {
> @@ -80,11 +105,169 @@ unsigned int blk_mq_num_online_queues(unsigned int max_queues)
>   }
>   EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
>   
> +static bool blk_mq_hk_validate(struct blk_mq_queue_map *qmap,
> +			       const struct cpumask *active_hctx)
> +{
> +	/*
> +	 * Verify if the mapping is usable.
> +	 *
> +	 * First, mark all hctx which have at least online houskeeping
> +	 * CPU assigned.
> +	 */
> +	for (int queue = 0; queue < qmap->nr_queues; queue++) {
> +		int cpu;
> +
> +		if (cpumask_test_cpu(queue, active_hctx)) {
> +			/*
> +			 * This htcx has at least one online houskeeping
> +			 * CPU thus it is able to serve any assigned
> +			 * isolated CPU.
> +			 */
> +			continue;
> +		}
> +
> +		/*
> +		 * There is no online houskeeping CPU for this hctx, all
> +		 * good as long as all isolated CPUs are also offline.
> +		 */
> +		for_each_online_cpu(cpu) {
> +			if (qmap->mq_map[cpu] != queue)
> +				continue;
> +
> +			pr_warn("Unable to create a usable CPU-to-queue mapping with the given constraints\n");
> +			return false;
> +		}
> +	}
> +
> +	return true;
> +}
> +
> +/*
> + * blk_mq_map_hk_queues - Create housekeeping CPU to
> + *                        hardware queue mapping
> + * @qmap:	CPU to hardware queue map
> + *
> + * Create a housekeeping CPU to hardware queue mapping in @qmap. @qmap
> + * contains a valid configuration honoring the isolcpus configuration.
> + */
> +static void blk_mq_map_hk_queues(struct blk_mq_queue_map *qmap)
> +{
> +	cpumask_var_t active_hctx __free(free_cpumask_var) = NULL;
> +	struct cpumask *hk_masks __free(kfree) = NULL;
> +	const struct cpumask *mask;
> +	unsigned int queue, cpu, nr_masks;
> +
> +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
> +		mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
> +	else
> +		goto fallback;
> +
> +	if (!zalloc_cpumask_var(&active_hctx, GFP_KERNEL))
> +		goto fallback;
> +
> +	/* Map housekeeping CPUs to a hctx */
> +	hk_masks = group_mask_cpus_evenly(qmap->nr_queues, mask, &nr_masks);
> +	if (!hk_masks)
> +		goto fallback;
> +
> +	for (queue = 0; queue < qmap->nr_queues; queue++) {
> +		unsigned int idx = (qmap->queue_offset + queue) % nr_masks;
> +
> +		for_each_cpu(cpu, &hk_masks[idx]) {
> +			qmap->mq_map[cpu] = idx;
> +
> +			if (cpu_online(cpu))
> +				cpumask_set_cpu(qmap->mq_map[cpu], active_hctx);

Why cpu_online? Up until this point it really didn't matter if the 
affinity mask was set to 'online' or 'possible' cpus, but here you
require CPUs to be online...

> +		}
> +	}
> +
> +	/* Map isolcpus to hardware context */
> +	queue = cpumask_first(active_hctx);
> +	for_each_cpu_andnot(cpu, cpu_possible_mask, mask) {
> +		qmap->mq_map[cpu] = (qmap->queue_offset + queue) % nr_masks;
> +		queue = cpumask_next_wrap(queue, active_hctx);
> +	}

Really? Doesn't this map _all_ cpus, and not just the isolcpus?

> +
> +	if (!blk_mq_hk_validate(qmap, active_hctx))
> +		goto fallback;
> +
> +	return;
> +
> +fallback:
> +	/*
> +	 * Map all CPUs to the first hctx to ensure at least one online
> +	 * housekeeping CPU is serving it.
> +	 */
> +	for_each_possible_cpu(cpu)
> +		qmap->mq_map[cpu] = 0;

I think you need to map all hctx, no?

> +}
> +
> +/*
> + * blk_mq_map_hk_irq_queues - Create housekeeping CPU to
> + *                            hardware queue mapping
> + * @dev:	The device to map queues
> + * @qmap:	CPU to hardware queue map
> + * @offset:	Queue offset to use for the device
> + *
> + * Create a housekeeping CPU to hardware queue mapping in @qmap. @qmap
> + * contains a valid configuration honoring the isolcpus configuration.
> + */
> +static void blk_mq_map_hk_irq_queues(struct device *dev,
> +				     struct blk_mq_queue_map *qmap,
> +				     int offset)
> +{
> +	cpumask_var_t active_hctx __free(free_cpumask_var) = NULL;
> +	cpumask_var_t mask __free(free_cpumask_var) = NULL;
> +	unsigned int queue, cpu;
> +
> +	if (!zalloc_cpumask_var(&active_hctx, GFP_KERNEL))
> +		goto fallback;
> +
> +	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
> +		goto fallback;
> +
> +	/* Map housekeeping CPUs to a hctx */
> +	for (queue = 0; queue < qmap->nr_queues; queue++) {
> +		for_each_cpu(cpu, dev->bus->irq_get_affinity(dev, offset + queue)) {
> +			qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +
> +			cpumask_set_cpu(cpu, mask);
> +			if (cpu_online(cpu))
> +				cpumask_set_cpu(qmap->mq_map[cpu], active_hctx);

Now that is really curious. You pick up the interrupt affinity from the
'bus', which, I assume, is the PCI bus. And this would imply that the
bus can (or already is) programmed for this interrupt affinity.
Which would imply that this is a usable interrupt affinity from the
hardware perspective, irrespective on whether the cpu is online or not.
So why the check to cpu_online()? Can't we simply take the existing 
affinity and rely on the hardware to do the right thing?

> +		}
> +	}
> +
> +	/* Map isolcpus to hardware context */
> +	queue = cpumask_first(active_hctx);
> +	for_each_cpu_andnot(cpu, cpu_possible_mask, mask) {
> +		qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +		queue = cpumask_next_wrap(queue, active_hctx);
> +	}
> +
> +	if (!blk_mq_hk_validate(qmap, active_hctx))
> +		goto fallback;
> +
> +	return;
> +
> +fallback:
> +	/*
> +	 * Map all CPUs to the first hctx to ensure at least one online
> +	 * housekeeping CPU is serving it.
> +	 */
> +	for_each_possible_cpu(cpu)
> +		qmap->mq_map[cpu] = 0;

Same comment as previously; don't we need to map all hctx?

> +}
> +
>   void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
>   {
>   	const struct cpumask *masks;
>   	unsigned int queue, cpu, nr_masks;
>   
> +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
> +		blk_mq_map_hk_queues(qmap);
> +		return;
> +	}
> +
>   	masks = group_cpus_evenly(qmap->nr_queues, &nr_masks);
>   	if (!masks) {
>   		for_each_possible_cpu(cpu)
> @@ -139,6 +322,11 @@ void blk_mq_map_hw_queues(struct blk_mq_queue_map *qmap,
>   	if (!dev->bus->irq_get_affinity)
>   		goto fallback;
>   
> +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
> +		blk_mq_map_hk_irq_queues(dev, qmap, offset);
> +		return;
> +	}
> +
>   	for (queue = 0; queue < qmap->nr_queues; queue++) {
>   		mask = dev->bus->irq_get_affinity(dev, queue + offset);
>   		if (!mask)
> 

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 09/10] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
  2025-07-02 16:33 ` [PATCH v7 09/10] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Daniel Wagner
@ 2025-07-03  6:58   ` Hannes Reinecke
  2025-07-07  7:44   ` Ming Lei
  2025-07-08  1:23   ` Aaron Tomlin
  2 siblings, 0 replies; 36+ messages in thread
From: Hannes Reinecke @ 2025-07-03  6:58 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Mathieu Desnoyers,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 7/2/25 18:33, Daniel Wagner wrote:
> When isolcpus=io_queue is enabled, and the last housekeeping CPU for a
> given hctx goes offline, there would be no CPU left to handle I/O. To
> prevent I/O stalls, prevent offlining housekeeping CPUs that are still
> serving isolated CPUs.
> 
> When isolcpus=io_queue is enabled and the last housekeeping CPU
> for a given hctx goes offline, no CPU would be left to handle I/O.
> To prevent I/O stalls, disallow offlining housekeeping CPUs that are
> still serving isolated CPUs.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   block/blk-mq.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 42 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 10/10] docs: add io_queue flag to isolcpus
  2025-07-02 16:34 ` [PATCH v7 10/10] docs: add io_queue flag to isolcpus Daniel Wagner
@ 2025-07-03  6:59   ` Hannes Reinecke
  2025-07-08  1:26   ` Aaron Tomlin
  1 sibling, 0 replies; 36+ messages in thread
From: Hannes Reinecke @ 2025-07-03  6:59 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Mathieu Desnoyers,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 7/2/25 18:34, Daniel Wagner wrote:
> The io_queue flag informs multiqueue device drivers where to place
> hardware queues. Document this new flag in the isolcpus
> command-line argument description.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   Documentation/admin-guide/kernel-parameters.txt | 19 ++++++++++++++++++-
>   1 file changed, 18 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 08/10] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-07-02 16:33 ` [PATCH v7 08/10] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
  2025-07-03  6:58   ` Hannes Reinecke
@ 2025-07-03  9:01   ` Christoph Hellwig
  2025-07-04  9:00     ` Daniel Wagner
  2025-07-03 14:47   ` kernel test robot
  2 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2025-07-03  9:01 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin, Aaron Tomlin, Martin K. Petersen,
	Thomas Gleixner, Costa Shulyupin, Juri Lelli, Valentin Schneider,
	Waiman Long, Ming Lei, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream

On Wed, Jul 02, 2025 at 06:33:58PM +0200, Daniel Wagner wrote:
>  const struct cpumask *blk_mq_possible_queue_affinity(void)
>  {
> +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
> +		return housekeeping_cpumask(HK_TYPE_IO_QUEUE);
> +
>  	return cpu_possible_mask;
>  }

I'm no expert on the housekeeping stuff, but why isn't the
housekeeping_enabled check done in housekeeping_cpumask directly so
that the drivers could use housekeeping_cpumask without a blk-mq
wrapper?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 08/10] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-07-02 16:33 ` [PATCH v7 08/10] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
  2025-07-03  6:58   ` Hannes Reinecke
  2025-07-03  9:01   ` Christoph Hellwig
@ 2025-07-03 14:47   ` kernel test robot
  2 siblings, 0 replies; 36+ messages in thread
From: kernel test robot @ 2025-07-03 14:47 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: llvm, oe-kbuild-all, Aaron Tomlin, Martin K. Petersen,
	Thomas Gleixner, Costa Shulyupin, Juri Lelli, Valentin Schneider,
	Waiman Long, Ming Lei, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

Hi Daniel,

kernel test robot noticed the following build errors:

[auto build test ERROR on 32f85e8468ce081d8e73ca3f0d588f1004013037]

url:    https://github.com/intel-lab-lkp/linux/commits/Daniel-Wagner/lib-group_cpus-Add-group_masks_cpus_evenly/20250703-003811
base:   32f85e8468ce081d8e73ca3f0d588f1004013037
patch link:    https://lore.kernel.org/r/20250702-isolcpus-io-queues-v7-8-557aa7eacce4%40kernel.org
patch subject: [PATCH v7 08/10] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
config: arm-allnoconfig (https://download.01.org/0day-ci/archive/20250703/202507032238.AoTmQnGP-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project f1a4bb62452d88a0edd9340b3ca7c9b11ad9193f)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250703/202507032238.AoTmQnGP-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507032238.AoTmQnGP-lkp@intel.com/

All errors (new ones prefixed by >>):

>> block/blk-mq-cpumap.c:155:16: error: array initializer must be an initializer list
     155 |         cpumask_var_t active_hctx __free(free_cpumask_var) = NULL;
         |                       ^
   block/blk-mq-cpumap.c:219:16: error: array initializer must be an initializer list
     219 |         cpumask_var_t active_hctx __free(free_cpumask_var) = NULL;
         |                       ^
   block/blk-mq-cpumap.c:220:16: error: array initializer must be an initializer list
     220 |         cpumask_var_t mask __free(free_cpumask_var) = NULL;
         |                       ^
   3 errors generated.


vim +155 block/blk-mq-cpumap.c

   144	
   145	/*
   146	 * blk_mq_map_hk_queues - Create housekeeping CPU to
   147	 *                        hardware queue mapping
   148	 * @qmap:	CPU to hardware queue map
   149	 *
   150	 * Create a housekeeping CPU to hardware queue mapping in @qmap. @qmap
   151	 * contains a valid configuration honoring the isolcpus configuration.
   152	 */
   153	static void blk_mq_map_hk_queues(struct blk_mq_queue_map *qmap)
   154	{
 > 155		cpumask_var_t active_hctx __free(free_cpumask_var) = NULL;
   156		struct cpumask *hk_masks __free(kfree) = NULL;
   157		const struct cpumask *mask;
   158		unsigned int queue, cpu, nr_masks;
   159	
   160		if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
   161			mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
   162		else
   163			goto fallback;
   164	
   165		if (!zalloc_cpumask_var(&active_hctx, GFP_KERNEL))
   166			goto fallback;
   167	
   168		/* Map housekeeping CPUs to a hctx */
   169		hk_masks = group_mask_cpus_evenly(qmap->nr_queues, mask, &nr_masks);
   170		if (!hk_masks)
   171			goto fallback;
   172	
   173		for (queue = 0; queue < qmap->nr_queues; queue++) {
   174			unsigned int idx = (qmap->queue_offset + queue) % nr_masks;
   175	
   176			for_each_cpu(cpu, &hk_masks[idx]) {
   177				qmap->mq_map[cpu] = idx;
   178	
   179				if (cpu_online(cpu))
   180					cpumask_set_cpu(qmap->mq_map[cpu], active_hctx);
   181			}
   182		}
   183	
   184		/* Map isolcpus to hardware context */
   185		queue = cpumask_first(active_hctx);
   186		for_each_cpu_andnot(cpu, cpu_possible_mask, mask) {
   187			qmap->mq_map[cpu] = (qmap->queue_offset + queue) % nr_masks;
   188			queue = cpumask_next_wrap(queue, active_hctx);
   189		}
   190	
   191		if (!blk_mq_hk_validate(qmap, active_hctx))
   192			goto fallback;
   193	
   194		return;
   195	
   196	fallback:
   197		/*
   198		 * Map all CPUs to the first hctx to ensure at least one online
   199		 * housekeeping CPU is serving it.
   200		 */
   201		for_each_possible_cpu(cpu)
   202			qmap->mq_map[cpu] = 0;
   203	}
   204	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 08/10] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-07-03  9:01   ` Christoph Hellwig
@ 2025-07-04  9:00     ` Daniel Wagner
  2025-07-07  5:42       ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Wagner @ 2025-07-04  9:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Sagi Grimberg,
	Michael S. Tsirkin, Aaron Tomlin, Martin K. Petersen,
	Thomas Gleixner, Costa Shulyupin, Juri Lelli, Valentin Schneider,
	Waiman Long, Ming Lei, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream

On Thu, Jul 03, 2025 at 11:01:58AM +0200, Christoph Hellwig wrote:
> On Wed, Jul 02, 2025 at 06:33:58PM +0200, Daniel Wagner wrote:
> >  const struct cpumask *blk_mq_possible_queue_affinity(void)
> >  {
> > +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
> > +		return housekeeping_cpumask(HK_TYPE_IO_QUEUE);
> > +
> >  	return cpu_possible_mask;
> >  }
> 
> I'm no expert on the housekeeping stuff, but why isn't the
> housekeeping_enabled check done in housekeeping_cpumask directly so
> that the drivers could use housekeeping_cpumask without a blk-mq
> wrapper?

Yes, housekeeping_cpumask will return cpu_possible_mask when housekeping
is disabled. Though some drivers want cpu_online_mask instead. If all
drivers would agree on one version of the mask it should allow to drop
to these helpers (maybe we the houskeeping API needs to be extended then
though)

This is also what Hannes brought up. If the number of supported hardware
queues for a device is less than cpu_possible_mask, it really makes
sense to distribute the hardware queues only between the online cpus. I
think the only two drivers which are interested in the cpu_possible_mask
are nvme-pci and virtio.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 08/10] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-07-03  6:58   ` Hannes Reinecke
@ 2025-07-04  9:21     ` Daniel Wagner
  0 siblings, 0 replies; 36+ messages in thread
From: Daniel Wagner @ 2025-07-04  9:21 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin, Aaron Tomlin,
	Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream

On Thu, Jul 03, 2025 at 08:58:02AM +0200, Hannes Reinecke wrote:
> > +	for (queue = 0; queue < qmap->nr_queues; queue++) {
> > +		unsigned int idx = (qmap->queue_offset + queue) % nr_masks;
> > +
> > +		for_each_cpu(cpu, &hk_masks[idx]) {
> > +			qmap->mq_map[cpu] = idx;
> > +
> > +			if (cpu_online(cpu))
> > +				cpumask_set_cpu(qmap->mq_map[cpu], active_hctx);
> 
> Why cpu_online? Up until this point it really didn't matter if the affinity
> mask was set to 'online' or 'possible' cpus, but here you
> require CPUs to be online...

This part here tracks if the a hardware context has at least one
housekeeping CPU online. It is possible to provide configuration where
we end up with hardware context which have offline housekeeping CPUs and
online isolcpus. active_hctx tracks which of the hardware contexts is
usable which is used in the next step...

> > +		}
> > +	}
> > +
> > +	/* Map isolcpus to hardware context */
> > +	queue = cpumask_first(active_hctx);
> > +	for_each_cpu_andnot(cpu, cpu_possible_mask, mask) {
> > +		qmap->mq_map[cpu] = (qmap->queue_offset + queue) % nr_masks;
> > +		queue = cpumask_next_wrap(queue, active_hctx);
> > +	}
> 
> Really? Doesn't this map _all_ cpus, and not just the isolcpus?

for_each_cpu iterates over is all CPUs which are not houskeeping CPUs
(mask is the housekeeping mask), thus these are all isol CPU. Note the
'andnot' part.

The cpumask_first/cpumask_next_wrap returns only hardware context which
have at least one housekeeping CPU which is online. Yes, it possible to
make this a bit smarter, so that we keep the grouping of the offline
CPUs intact, though I am not sure if it is worth to add complexity for a
corner case at least not yet.

> > +fallback:
> > +	/*
> > +	 * Map all CPUs to the first hctx to ensure at least one online
> > +	 * housekeeping CPU is serving it.
> > +	 */
> > +	for_each_possible_cpu(cpu)
> > +		qmap->mq_map[cpu] = 0;
> 
> I think you need to map all hctx, no?

The block layer is filtering out hctx which have no CPU assigned to it
when selecting a queue. This is really a failsafe mode, it just makes
sure the system boots.

> > +	/* Map housekeeping CPUs to a hctx */
> > +	for (queue = 0; queue < qmap->nr_queues; queue++) {
> > +		for_each_cpu(cpu, dev->bus->irq_get_affinity(dev, offset + queue)) {
> > +			qmap->mq_map[cpu] = qmap->queue_offset + queue;
> > +
> > +			cpumask_set_cpu(cpu, mask);
> > +			if (cpu_online(cpu))
> > +				cpumask_set_cpu(qmap->mq_map[cpu], active_hctx);
> 
> Now that is really curious. You pick up the interrupt affinity from the
> 'bus', which, I assume, is the PCI bus. And this would imply that the
> bus can (or already is) programmed for this interrupt affinity.

Yes, this is the case. irq_create_affinity_masks which use
group_cpu_evenly/group_mask_cpu_evenly for the number of requested IRQs.
The number of IRQs can be higher than the number of requested queues
here. It's necessary to use the affinity mask created by
irq_create_affinity_mask as input.

> Which would imply that this is a usable interrupt affinity from the
> hardware perspective, irrespective on whether the cpu is online or
> not. So why the check to cpu_online()? Can't we simply take the existing affinity
> and rely on the hardware to do the right thing?

Again, this is tracking if a htcx has online housekeeping CPU.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 05/10] scsi: Use block layer helpers to constrain queue affinity
  2025-07-03  6:43   ` Hannes Reinecke
@ 2025-07-04  9:37     ` Daniel Wagner
  2025-07-04 10:28       ` Hannes Reinecke
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Wagner @ 2025-07-04  9:37 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin, Aaron Tomlin,
	Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream

On Thu, Jul 03, 2025 at 08:43:01AM +0200, Hannes Reinecke wrote:
> All of these drivers are not aware of CPU hotplug, and as such
> will not be notified when the number of CPUs changes.

Ah, this explains this part.

> But you use 'blk_mq_online_queue_affinity()' for all of these
> drivers.

All these drivers are also using blk_mq_num_online_queue. When I only
used cpu_possible_mask the resulting mapping was not usable.

> Wouldn't 'blk_mq_possible_queue_affinit()' a better choice here
> to insulate against CPU hotplug effects?

With this mask the queues will be distributed to all possible CPUs and
some of the hardware queues could be assigned to offline CPUs. I think
this would work but the question is, is this okay to leave some of the
perfomance on the road?

I am not against this, just saying it would change the existing
behavior.

> Also some drivers which are using irq affinity (eg aacraid, lpfc) are
> missing from these conversions. Why?

I was not aware of aacraid. I started to work on lpfc and well let's put
it this way, it's complicated. lpfc needs a lot of work to make it
isolcpus aware.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 05/10] scsi: Use block layer helpers to constrain queue affinity
  2025-07-04  9:37     ` Daniel Wagner
@ 2025-07-04 10:28       ` Hannes Reinecke
  2025-07-04 12:30         ` Daniel Wagner
  0 siblings, 1 reply; 36+ messages in thread
From: Hannes Reinecke @ 2025-07-04 10:28 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin, Aaron Tomlin,
	Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream

On 7/4/25 11:37, Daniel Wagner wrote:
> On Thu, Jul 03, 2025 at 08:43:01AM +0200, Hannes Reinecke wrote:
>> All of these drivers are not aware of CPU hotplug, and as such
>> will not be notified when the number of CPUs changes.
> 
> Ah, this explains this part.
> 
>> But you use 'blk_mq_online_queue_affinity()' for all of these
>> drivers.
> 
> All these drivers are also using blk_mq_num_online_queue. When I only
> used cpu_possible_mask the resulting mapping was not usable.
> 
Yeah, I'd imagine so. Quite some drivers have 'interesting' ideas how
the firmware interface should look like.

But it also means that there is a very high likelyhood that these
drivers become inoperable under CPU hotplug.
Is there a way of disabling CPU hotplug when these drivers are in use?

>> Wouldn't 'blk_mq_possible_queue_affinit()' a better choice here
>> to insulate against CPU hotplug effects?
> 
> With this mask the queues will be distributed to all possible CPUs and
> some of the hardware queues could be assigned to offline CPUs. I think
> this would work but the question is, is this okay to leave some of the
> perfomance on the road?
> 
It really shouldn't be an issue when the cpus are distributed 
'correctly' :-)
We have several possibilities:
-> #hwq > num_possible_cpus: easy, 1:1 mapping, no problem
-> num_online_cpu < #hwq < num_possible_cpus: Not as easy, but if we
    ensure that each online cpu is mapped to a different hwq we don't
    have a performance impact.
-> #hwq < num_online_cpu: If we ensure that a) the number of online cpus
    per hwq is (roughly) identical we won't have a performance impact.
    As a bonus we should strive to have the number of offline cpus
    distributed equally on each hwq.

Of course, that doesn't take into accound NUMA locality; with NUMA 
locality you would need to ensure to have at least one CPU per NUMA node
mapped to each hwq. Which actually would impose a lower limit on the
number (and granularity!) of hwqs (namely the number of NUMA nodes), but 
that's fair, I guess.

But this really can be delegated to later patches; initially we really
should identify which drivers might have issues with CPU hotplug,
and at the very least issue a warning for these drivers.

> I am not against this, just saying it would change the existing
> behavior.
> 

Oh, sure. No-one (except lpfc on Power) is testing CPU hotplug actively.

>> Also some drivers which are using irq affinity (eg aacraid, lpfc) are
>> missing from these conversions. Why?
> 
> I was not aware of aacraid. I started to work on lpfc and well let's put
> it this way, it's complicated. lpfc needs a lot of work to make it
> isolcpus aware.

Yeah, I know. Sorry ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 05/10] scsi: Use block layer helpers to constrain queue affinity
  2025-07-04 10:28       ` Hannes Reinecke
@ 2025-07-04 12:30         ` Daniel Wagner
  0 siblings, 0 replies; 36+ messages in thread
From: Daniel Wagner @ 2025-07-04 12:30 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin, Aaron Tomlin,
	Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream

On Fri, Jul 04, 2025 at 12:28:49PM +0200, Hannes Reinecke wrote:
> It really shouldn't be an issue when the cpus are distributed 'correctly'
> :-)

If I get the drift, we start to discuss how the mapping could be
normally, so not for isolcpus. The isolcpus case is just how many hwq
are available (and affinity):

 num queues to map = min(num housekeeping CPUs, #hwq)

and then it's common code, no special housekeeping mapping code.

> We have several possibilities:
> -> #hwq > num_possible_cpus: easy, 1:1 mapping, no problem

I agree, no problem here.

> -> num_online_cpu < #hwq < num_possible_cpus: Not as easy, but if we
>    ensure that each online cpu is mapped to a different hwq we don't
>    have a performance impact.

This should also be fairly straightforward too. First assign each online
CPU a hwq and distribute the rest of the hwq amount the rest of the
possible offline CPUs.

> -> #hwq < num_online_cpu: If we ensure that a) the number of online cpus
>    per hwq is (roughly) identical we won't have a performance impact.
>    As a bonus we should strive to have the number of offline cpus
>    distributed equally on each hwq.

__group_cpus_evenly is handling this pretty well.

> Of course, that doesn't take into accound NUMA locality; with NUMA locality
> you would need to ensure to have at least one CPU per NUMA node
> mapped to each hwq. Which actually would impose a lower limit on the
> number (and granularity!) of hwqs (namely the number of NUMA nodes), but
> that's fair, I guess.

Again __group_cpus_evenly is taking NUMA into account as I understand it.

> But this really can be delegated to later patches; initially we really
> should identify which drivers might have issues with CPU hotplug,
> and at the very least issue a warning for these drivers.

There are different ways I suppose. My approach is not to change the
drivers too much because I don't have access to all the hardware for
testing. Instead extend the core code so that the different cases are
covered.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 08/10] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-07-04  9:00     ` Daniel Wagner
@ 2025-07-07  5:42       ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2025-07-07  5:42 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Christoph Hellwig, Daniel Wagner, Jens Axboe, Keith Busch,
	Sagi Grimberg, Michael S. Tsirkin, Aaron Tomlin,
	Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Mathieu Desnoyers, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization, GR-QLogic-Storage-Upstream

On Fri, Jul 04, 2025 at 11:00:56AM +0200, Daniel Wagner wrote:
> > I'm no expert on the housekeeping stuff, but why isn't the
> > housekeeping_enabled check done in housekeeping_cpumask directly so
> > that the drivers could use housekeeping_cpumask without a blk-mq
> > wrapper?
> 
> Yes, housekeeping_cpumask will return cpu_possible_mask when housekeping
> is disabled. Though some drivers want cpu_online_mask instead. If all
> drivers would agree on one version of the mask it should allow to drop
> to these helpers (maybe we the houskeeping API needs to be extended then
> though)

Drivers don't get cpu hotplug notifications, so cpu_possible_mask is
the only valid answer right now.  That could change if we ever implement
notifications to the drivers.

> This is also what Hannes brought up. If the number of supported hardware
> queues for a device is less than cpu_possible_mask, it really makes
> sense to distribute the hardware queues only between the online cpus. I
> think the only two drivers which are interested in the cpu_possible_mask
> are nvme-pci and virtio.

That's the only two drivers that get it right :(

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 09/10] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
  2025-07-02 16:33 ` [PATCH v7 09/10] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Daniel Wagner
  2025-07-03  6:58   ` Hannes Reinecke
@ 2025-07-07  7:44   ` Ming Lei
  2025-07-08  1:31     ` Aaron Tomlin
  2025-07-08  1:23   ` Aaron Tomlin
  2 siblings, 1 reply; 36+ messages in thread
From: Ming Lei @ 2025-07-07  7:44 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin, Aaron Tomlin, Martin K. Petersen,
	Thomas Gleixner, Costa Shulyupin, Juri Lelli, Valentin Schneider,
	Waiman Long, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On Wed, Jul 02, 2025 at 06:33:59PM +0200, Daniel Wagner wrote:
> When isolcpus=io_queue is enabled, and the last housekeeping CPU for a
> given hctx goes offline, there would be no CPU left to handle I/O. To
> prevent I/O stalls, prevent offlining housekeeping CPUs that are still
> serving isolated CPUs.
> 
> When isolcpus=io_queue is enabled and the last housekeeping CPU
> for a given hctx goes offline, no CPU would be left to handle I/O.
> To prevent I/O stalls, disallow offlining housekeeping CPUs that are
> still serving isolated CPUs.

If you do so, cpu offline is failed, and this change is user visible, please
document the cpu offline failure limit on Documentation/admin-guide/kernel-parameters.txt.
in the next patch.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 09/10] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
  2025-07-02 16:33 ` [PATCH v7 09/10] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Daniel Wagner
  2025-07-03  6:58   ` Hannes Reinecke
  2025-07-07  7:44   ` Ming Lei
@ 2025-07-08  1:23   ` Aaron Tomlin
  2 siblings, 0 replies; 36+ messages in thread
From: Aaron Tomlin @ 2025-07-08  1:23 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On Wed, Jul 02, 2025 at 06:33:59PM +0200, Daniel Wagner wrote:
> When isolcpus=io_queue is enabled, and the last housekeeping CPU for a
> given hctx goes offline, there would be no CPU left to handle I/O. To
> prevent I/O stalls, prevent offlining housekeeping CPUs that are still
> serving isolated CPUs.
> 
> When isolcpus=io_queue is enabled and the last housekeeping CPU
> for a given hctx goes offline, no CPU would be left to handle I/O.
> To prevent I/O stalls, disallow offlining housekeeping CPUs that are
> still serving isolated CPUs.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>  block/blk-mq.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0c61492724d228736f975f1d8f195515603801b6..87240644f73ed0490a5459e042c68e0e168f727d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3681,6 +3681,43 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
>  	return data.has_rq;
>  }
>  
> +static bool blk_mq_hctx_can_offline_hk_cpu(struct blk_mq_hw_ctx *hctx,
> +					   unsigned int this_cpu)
> +{
> +	const struct cpumask *hk_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
> +
> +	for (int i = 0; i < hctx->nr_ctx; i++) {
> +		struct blk_mq_ctx *ctx = hctx->ctxs[i];
> +
> +		if (ctx->cpu == this_cpu)
> +			continue;
> +
> +		/*
> +		 * Check if this context has at least one online
> +		 * housekeeping CPU; in this case the hardware context is
> +		 * usable.
> +		 */
> +		if (cpumask_test_cpu(ctx->cpu, hk_mask) &&
> +		    cpu_online(ctx->cpu))
> +			break;
> +
> +		/*
> +		 * The context doesn't have any online housekeeping CPUs,
> +		 * but there might be an online isolated CPU mapped to
> +		 * it.
> +		 */
> +		if (cpu_is_offline(ctx->cpu))
> +			continue;
> +
> +		pr_warn("%s: trying to offline hctx%d but there is still an online isolcpu CPU %d mapped to it\n",
> +			hctx->queue->disk->disk_name,
> +			hctx->queue_num, ctx->cpu);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
>  static bool blk_mq_hctx_has_online_cpu(struct blk_mq_hw_ctx *hctx,
>  		unsigned int this_cpu)
>  {
> @@ -3712,6 +3749,11 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
>  	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
>  			struct blk_mq_hw_ctx, cpuhp_online);
>  
> +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
> +		if (!blk_mq_hctx_can_offline_hk_cpu(hctx, cpu))
> +			return -EINVAL;
> +	}
> +
>  	if (blk_mq_hctx_has_online_cpu(hctx, cpu))
>  		return 0;
>  
> 
> -- 
> 2.50.0
> 

Thanks Daniel and great work!

Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>

-- 
Aaron Tomlin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 07/10] isolation: Introduce io_queue isolcpus type
  2025-07-02 16:33 ` [PATCH v7 07/10] isolation: Introduce io_queue isolcpus type Daniel Wagner
@ 2025-07-08  1:26   ` Aaron Tomlin
  0 siblings, 0 replies; 36+ messages in thread
From: Aaron Tomlin @ 2025-07-08  1:26 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On Wed, Jul 02, 2025 at 06:33:57PM +0200, Daniel Wagner wrote:
> Multiqueue drivers spread I/O queues across all CPUs for optimal
> performance. However, these drivers are not aware of CPU isolation
> requirements and will distribute queues without considering the isolcpus
> configuration.
> 
> Introduce a new isolcpus mask that allows users to define which CPUs
> should have I/O queues assigned. This is similar to managed_irq, but
> intended for drivers that do not use the managed IRQ infrastructure
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>  include/linux/sched/isolation.h | 1 +
>  kernel/sched/isolation.c        | 7 +++++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> index d8501f4709b583b8a1c91574446382f093bccdb1..6b6ae9c5b2f61a93c649a98ea27482b932627fca 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -9,6 +9,7 @@
>  enum hk_type {
>  	HK_TYPE_DOMAIN,
>  	HK_TYPE_MANAGED_IRQ,
> +	HK_TYPE_IO_QUEUE,
>  	HK_TYPE_KERNEL_NOISE,
>  	HK_TYPE_MAX,
>  
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 93b038d48900a304a29ecc0c8aa8b7d419ea1397..c8cb0cf2b15a11524be73826f38bb2a0709c449c 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -11,6 +11,7 @@
>  enum hk_flags {
>  	HK_FLAG_DOMAIN		= BIT(HK_TYPE_DOMAIN),
>  	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
> +	HK_FLAG_IO_QUEUE	= BIT(HK_TYPE_IO_QUEUE),
>  	HK_FLAG_KERNEL_NOISE	= BIT(HK_TYPE_KERNEL_NOISE),
>  };
>  
> @@ -224,6 +225,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
>  			continue;
>  		}
>  
> +		if (!strncmp(str, "io_queue,", 9)) {
> +			str += 9;
> +			flags |= HK_FLAG_IO_QUEUE;
> +			continue;
> +		}
> +
>  		/*
>  		 * Skip unknown sub-parameter and validate that it is not
>  		 * containing an invalid character.
> 
> -- 
> 2.50.0
> 

Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>

-- 
Aaron Tomlin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 10/10] docs: add io_queue flag to isolcpus
  2025-07-02 16:34 ` [PATCH v7 10/10] docs: add io_queue flag to isolcpus Daniel Wagner
  2025-07-03  6:59   ` Hannes Reinecke
@ 2025-07-08  1:26   ` Aaron Tomlin
  1 sibling, 0 replies; 36+ messages in thread
From: Aaron Tomlin @ 2025-07-08  1:26 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On Wed, Jul 02, 2025 at 06:34:00PM +0200, Daniel Wagner wrote:
> The io_queue flag informs multiqueue device drivers where to place
> hardware queues. Document this new flag in the isolcpus
> command-line argument description.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index f1f2c0874da9ddfc95058c464fdf5dabaf0de713..7594ac5448575cc895ebf7af0fe051d42dc5e0e9 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2590,7 +2590,6 @@
>  			  "number of CPUs in system - 1".
>  
>  			managed_irq
> -
>  			  Isolate from being targeted by managed interrupts
>  			  which have an interrupt mask containing isolated
>  			  CPUs. The affinity of managed interrupts is
> @@ -2613,6 +2612,24 @@
>  			  housekeeping CPUs has no influence on those
>  			  queues.
>  
> +			io_queue
> +			  Isolate from I/O queue work caused by multiqueue
> +			  device drivers. Restrict the placement of
> +			  queues to housekeeping CPUs only, ensuring that
> +			  all I/O work is processed by a housekeeping CPU.
> +
> +			  Housekeeping CPUs that serve isolated CPUs
> +			  cannot be offlined.
> +
> +			  The io_queue configuration takes precedence over
> +			  managed_irq; thus, when io_queue is used,
> +			  managed_irq has no effect.
> +
> +			  Note: When an isolated CPU issues an I/O request,
> +			  it is forwarded to a housekeeping CPU. This will
> +			  trigger a software interrupt on the completion
> +			  path.
> +
>  			The format of <cpu-list> is described above.
>  
>  	iucv=		[HW,NET]
> 
> -- 
> 2.50.0
> 

Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>

-- 
Aaron Tomlin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 09/10] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
  2025-07-07  7:44   ` Ming Lei
@ 2025-07-08  1:31     ` Aaron Tomlin
  0 siblings, 0 replies; 36+ messages in thread
From: Aaron Tomlin @ 2025-07-08  1:31 UTC (permalink / raw)
  To: Daniel Wagner, Ming Lei
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin, Martin K. Petersen,
	Thomas Gleixner, Costa Shulyupin, Juri Lelli, Valentin Schneider,
	Waiman Long, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On Mon, Jul 07, 2025 at 03:44:30PM +0800, Ming Lei wrote:
> On Wed, Jul 02, 2025 at 06:33:59PM +0200, Daniel Wagner wrote:
> > When isolcpus=io_queue is enabled, and the last housekeeping CPU for a
> > given hctx goes offline, there would be no CPU left to handle I/O. To
> > prevent I/O stalls, prevent offlining housekeeping CPUs that are still
> > serving isolated CPUs.
> > 
> > When isolcpus=io_queue is enabled and the last housekeeping CPU
> > for a given hctx goes offline, no CPU would be left to handle I/O.
> > To prevent I/O stalls, disallow offlining housekeeping CPUs that are
> > still serving isolated CPUs.
> 
> If you do so, cpu offline is failed, and this change is user visible, please
> document the cpu offline failure limit on Documentation/admin-guide/kernel-parameters.txt.
> in the next patch.
> 
> Thanks,
> Ming

+1

-- 
Aaron Tomlin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 01/10] lib/group_cpus: Add group_masks_cpus_evenly()
  2025-07-02 16:33 ` [PATCH v7 01/10] lib/group_cpus: Add group_masks_cpus_evenly() Daniel Wagner
  2025-07-03  6:18   ` Hannes Reinecke
@ 2025-07-11  8:28   ` John Garry
  2025-09-03 12:42     ` Daniel Wagner
  1 sibling, 1 reply; 36+ messages in thread
From: John Garry @ 2025-07-11  8:28 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-block, linux-nvme,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 02/07/2025 17:33, Daniel Wagner wrote:

/s/group_masks_cpus_evenly/group_mask_cpus_evenly/

> group_mask_cpus_evenly() allows the caller to pass in a CPU mask that
> should be evenly distributed. This new function is a more generic
> version of the existing group_cpus_evenly(), which always distributes
> all present CPUs into groups.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   include/linux/group_cpus.h |  3 +++
>   lib/group_cpus.c           | 64 +++++++++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 66 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/group_cpus.h b/include/linux/group_cpus.h
> index 9d4e5ab6c314b31c09fda82c3f6ac18f77e9de36..d4604dce1316a08400e982039006331f34c18ee8 100644
> --- a/include/linux/group_cpus.h
> +++ b/include/linux/group_cpus.h
> @@ -10,5 +10,8 @@
>   #include <linux/cpu.h>
>   
>   struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks);
> +struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
> +				       const struct cpumask *cpu_mask,
> +				       unsigned int *nummasks);
>   
>   #endif
> diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> index 6d08ac05f371bf880571507d935d9eb501616a84..00c9b7a10c8acd29239fe20d2a30fdae22ef74a5 100644
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -8,6 +8,7 @@
>   #include <linux/cpu.h>
>   #include <linux/sort.h>
>   #include <linux/group_cpus.h>
> +#include <linux/sched/isolation.h>
>   
>   #ifdef CONFIG_SMP
>   
> @@ -425,6 +426,59 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
>   	*nummasks = min(nr_present + nr_others, numgrps);
>   	return masks;
>   }
> +EXPORT_SYMBOL_GPL(group_cpus_evenly);
> +
> +/**
> + * group_mask_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
> + * @numgrps: number of groups

this comment could be a bit more useful

> + * @cpu_mask: CPU to consider for the grouping

this is a CPU mask, and not a specific CPU index, right?

> + * @nummasks: number of initialized cpusmasks
> + *
> + * Return: cpumask array if successful, NULL otherwise. And each element
> + * includes CPUs assigned to this group.
> + *
> + * Try to put close CPUs from viewpoint of CPU and NUMA locality into
> + * same group. Allocate present CPUs on these groups evenly.
> + */
> +struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
> +				       const struct cpumask *cpu_mask,
> +				       unsigned int *nummasks)
> +{
> +	cpumask_var_t *node_to_cpumask;
> +	cpumask_var_t nmsk;
> +	int ret = -ENOMEM;
> +	struct cpumask *masks = NULL;
> +
> +	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
> +		return NULL;
> +
> +	node_to_cpumask = alloc_node_to_cpumask();
> +	if (!node_to_cpumask)
> +		goto fail_nmsk;
> +
> +	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
> +	if (!masks)
> +		goto fail_node_to_cpumask;
> +
> +	build_node_to_cpumask(node_to_cpumask);
> +
> +	ret = __group_cpus_evenly(0, numgrps, node_to_cpumask, cpu_mask, nmsk,
> +				  masks);

maybe it's personal taste, but I don't think that it's a good style to 
always pass through 'fail' labels, even if we have not failed in some step

> +
> +fail_node_to_cpumask:
> +	free_node_to_cpumask(node_to_cpumask);
> +
> +fail_nmsk:
> +	free_cpumask_var(nmsk);
> +	if (ret < 0) {
> +		kfree(masks);
> +		return NULL;
> +	}
> +	*nummasks = ret;
> +	return masks;
> +}
> +EXPORT_SYMBOL_GPL(group_mask_cpus_evenly);
> +
>   #else /* CONFIG_SMP */
>   struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
>   {
> @@ -442,5 +496,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
>   	*nummasks = 1;
>   	return masks;
>   }
> -#endif /* CONFIG_SMP */
>   EXPORT_SYMBOL_GPL(group_cpus_evenly);
> +
> +struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
> +				       const struct cpumask *cpu_mask,
> +				       unsigned int *nummasks)
> +{
> +	return group_cpus_evenly(numgrps, nummasks);
> +}
> +EXPORT_SYMBOL_GPL(group_mask_cpus_evenly);
> +#endif /* CONFIG_SMP */
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 01/10] lib/group_cpus: Add group_masks_cpus_evenly()
  2025-07-03  6:18   ` Hannes Reinecke
@ 2025-09-03 12:36     ` Daniel Wagner
  0 siblings, 0 replies; 36+ messages in thread
From: Daniel Wagner @ 2025-09-03 12:36 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin, Aaron Tomlin,
	Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream

On Thu, Jul 03, 2025 at 08:18:50AM +0200, Hannes Reinecke wrote:
> > +/**
> > + * group_mask_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
> > + * @numgrps: number of groups
> > + * @cpu_mask: CPU to consider for the grouping
> > + * @nummasks: number of initialized cpusmasks
> > + *
> > + * Return: cpumask array if successful, NULL otherwise. And each element
> > + * includes CPUs assigned to this group.
> > + *
> > + * Try to put close CPUs from viewpoint of CPU and NUMA locality into
> > + * same group. Allocate present CPUs on these groups evenly.
> > + */
> 
> Description could be improved. Point is that you do not do any
> calculation here, you just call __group_cpus_evenly() with
> a different mask.

I updated the documentation, it matches with group_cpus_evenly but with
the constrain mask.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 01/10] lib/group_cpus: Add group_masks_cpus_evenly()
  2025-07-11  8:28   ` John Garry
@ 2025-09-03 12:42     ` Daniel Wagner
  0 siblings, 0 replies; 36+ messages in thread
From: Daniel Wagner @ 2025-09-03 12:42 UTC (permalink / raw)
  To: John Garry
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin, Aaron Tomlin,
	Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Mathieu Desnoyers, linux-kernel,
	linux-block, linux-nvme, megaraidlinux.pdl, linux-scsi,
	storagedev, virtualization, GR-QLogic-Storage-Upstream

On Fri, Jul 11, 2025 at 09:28:12AM +0100, John Garry wrote:
> > +/**
> > + * group_mask_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
> > + * @numgrps: number of groups
> 
> this comment could be a bit more useful
> 
> > + * @cpu_mask: CPU to consider for the grouping
> 
> this is a CPU mask, and not a specific CPU index, right?

Yes, I've updated the documentation to:

/**
 * group_mask_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
 * @numgrps: number of cpumasks to create
 * @mask: CPUs to consider for the grouping
 * @nummasks: number of initialized cpusmasks
 *
 * Return: cpumask array if successful, NULL otherwise. Only the CPUs
 * marked in the mask will be considered for the grouping. And each
 * element includes CPUs assigned to this group. nummasks contains the
 * number of initialized masks which can be less than numgrps. cpu_mask
 *
 * Try to put close CPUs from viewpoint of CPU and NUMA locality into
 * same group, and run two-stage grouping:
 *	1) allocate present CPUs on these groups evenly first
 *	2) allocate other possible CPUs on these groups evenly
 *
 * We guarantee in the resulted grouping that all CPUs are covered, and
 * no same CPU is assigned to multiple groups
 */
struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
				       const struct cpumask *mask,
				       unsigned int *nummasks)

> > +	ret = __group_cpus_evenly(0, numgrps, node_to_cpumask, cpu_mask, nmsk,
> > +				  masks);
> 
> maybe it's personal taste, but I don't think that it's a good style to
> always pass through 'fail' labels, even if we have not failed in some
> step

I'd rather leave it as it is, because it matches the existing code in
group_cpus_evenly. And there is also alloc_node_to_cpumask which does
the same. Consistency wins IMO.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-09-03 12:42 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-02 16:33 [PATCH v7 00/10] blk: honor isolcpus configuration Daniel Wagner
2025-07-02 16:33 ` [PATCH v7 01/10] lib/group_cpus: Add group_masks_cpus_evenly() Daniel Wagner
2025-07-03  6:18   ` Hannes Reinecke
2025-09-03 12:36     ` Daniel Wagner
2025-07-11  8:28   ` John Garry
2025-09-03 12:42     ` Daniel Wagner
2025-07-02 16:33 ` [PATCH v7 02/10] genirq/affinity: Add cpumask to struct irq_affinity Daniel Wagner
2025-07-03  6:19   ` Hannes Reinecke
2025-07-02 16:33 ` [PATCH v7 03/10] blk-mq: add blk_mq_{online|possible}_queue_affinity Daniel Wagner
2025-07-03  6:29   ` Hannes Reinecke
2025-07-02 16:33 ` [PATCH v7 04/10] nvme-pci: use block layer helpers to constrain queue affinity Daniel Wagner
2025-07-03  6:29   ` Hannes Reinecke
2025-07-02 16:33 ` [PATCH v7 05/10] scsi: Use " Daniel Wagner
2025-07-03  6:43   ` Hannes Reinecke
2025-07-04  9:37     ` Daniel Wagner
2025-07-04 10:28       ` Hannes Reinecke
2025-07-04 12:30         ` Daniel Wagner
2025-07-02 16:33 ` [PATCH v7 06/10] virtio: blk/scsi: use " Daniel Wagner
2025-07-03  6:43   ` Hannes Reinecke
2025-07-02 16:33 ` [PATCH v7 07/10] isolation: Introduce io_queue isolcpus type Daniel Wagner
2025-07-08  1:26   ` Aaron Tomlin
2025-07-02 16:33 ` [PATCH v7 08/10] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
2025-07-03  6:58   ` Hannes Reinecke
2025-07-04  9:21     ` Daniel Wagner
2025-07-03  9:01   ` Christoph Hellwig
2025-07-04  9:00     ` Daniel Wagner
2025-07-07  5:42       ` Christoph Hellwig
2025-07-03 14:47   ` kernel test robot
2025-07-02 16:33 ` [PATCH v7 09/10] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Daniel Wagner
2025-07-03  6:58   ` Hannes Reinecke
2025-07-07  7:44   ` Ming Lei
2025-07-08  1:31     ` Aaron Tomlin
2025-07-08  1:23   ` Aaron Tomlin
2025-07-02 16:34 ` [PATCH v7 10/10] docs: add io_queue flag to isolcpus Daniel Wagner
2025-07-03  6:59   ` Hannes Reinecke
2025-07-08  1:26   ` Aaron Tomlin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).