[PATCH v8 00/12] blk: honor isolcpus configuration

public inbox for virtualization@lists.linux-foundation.org
 help / color / mirror / Atom feed

* [PATCH v8 00/12] blk: honor isolcpus configuration
@ 2025-09-05 14:59 Daniel Wagner
  2025-09-05 14:59 ` [PATCH v8 01/12] scsi: aacraid: use block layer helpers to calculate num of queues Daniel Wagner
                   ` (12 more replies)
  0 siblings, 13 replies; 29+ messages in thread
From: Daniel Wagner @ 2025-09-05 14:59 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, Aaron Tomlin, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

The main changes in this version are

  - merged the mapping algorithm into the existing code
  - dropping a bunch of SCSI drivers update

With the merging of the isolcpus-aware mapping code, there is a change in
how the resulting CPU–hctx mapping looks for systems with identical CPUs
(non-hyperthreaded CPUs). My understanding is that it shouldn't matter,
but the devil is in the details.

  Package L#0
    NUMANode L#0 (P#0 3255MB)
    L3 L#0 (16MB)
      L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
      L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2)
      L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3)
      L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 (P#4)
      L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 (P#5)
      L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 (P#6)
      L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 (P#7)

base version:
queue mapping for /dev/nvme0n1
        hctx0: default 0 8
        hctx1: default 1 9
        hctx2: default 2 10
        hctx3: default 3 11
        hctx4: default 4 12
        hctx5: default 5 13
        hctx6: default 6 14
        hctx7: default 7 15

patched:
queue mapping for /dev/nvme0n1
        hctx0: default 0 1
        hctx1: default 2 3
        hctx2: default 4 5
        hctx3: default 6 7
        hctx4: default 8 9
        hctx5: default 10 11
        hctx6: default 12 13
        hctx7: default 14 15

  Package L#0 + L3 L#0 (16MB)
    L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#1)
    L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1
      PU L#2 (P#2)
      PU L#3 (P#3)
    L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2
      PU L#4 (P#4)
      PU L#5 (P#5)
    L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3
      PU L#6 (P#6)
      PU L#7 (P#7)
  Package L#1 + L3 L#1 (16MB)
    L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4
      PU L#8 (P#8)
      PU L#9 (P#9)
    L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5
      PU L#10 (P#10)
      PU L#11 (P#11)
    L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6
      PU L#12 (P#12)
      PU L#13 (P#13)
    L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7
      PU L#14 (P#14)
      PU L#15 (P#15)

base and patched:
queue mapping for /dev/nvme0n1
        hctx0: default 0 1
        hctx1: default 2 3
        hctx2: default 4 5
        hctx3: default 6 7
        hctx4: default 8 9
        hctx5: default 10 11
        hctx6: default 12 13
        hctx7: default 14 15

As mentioned I've decided to update only SCSI drivers which are already
using pci_alloc_irq_vectors_affinity with the PCI_IRQ_AFFINITY. These
drivers are using the auto IRQ affinity managment code, which is what is
the pre-condition for isolcpus to work.

Also missing are the FC drivers which support nvme-fabrics (lpfc,
qla2xxx). The nvme-fabrics code needs to be touched first. I've got the
patches for this, but let's first get the main change in shape.

After that, I can start updating one driver one by one. I think this
reduced the risk of regression significantly.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
Changes in v8:
- added 524f5eea4bbe ("lib/group_cpus: remove !SMP code")
- merged new logic into existing function, avoid special casing
- group_mask_cpus_evenly:
  - /s/group_masks_cpus_evenly/group_mask_cpus_evenly
  - updated comment on group_mask_cpus_evenly
  - renamed argument from cpu_mask to mask
- aacraid: added missing num queue calculcation (new patch)
- only update scsi drivers which support PCI_IRQ_AFFINIT,
  and do not support nvme-fabrics
- don't __free for cpumask_var_t, it seems incompatible
- updated doc to hightlight the CPU offlining limitation
- collected tags
- Link to v7: https://patch.msgid.link/20250702-isolcpus-io-queues-v7-0-557aa7eacce4@kernel.org

Changes in v7:
- send out first part of the series:
  https://lore.kernel.org/all/20250617-isolcpus-queue-counters-v1-0-13923686b54b@kernel.org/
- added command line documentation
- added validation code, so that resulting mapping is operational
- rewrote mapping code for isolcpus so it takes into account active hctx
- added blk_mq_map_hk_irq_queues which uses mask from irq_get_affinity
- refactored blk_mq_map_hk_queues so caller tests for HK_TYPE_MANAGED_IRQ
- Link to v6: https://patch.msgid.link/20250424-isolcpus-io-queues-v6-0-9a53a870ca1f@kernel.org

Changes in v6:
- added io_queue isolcpus type back
- prevent offlining hk cpu if a isol cpu is still present isntead just warning
- Link to v5: https://lore.kernel.org/r/20250110-isolcpus-io-queues-v5-0-0e4f118680b0@kernel.org

Changes in v5:
- rebased on latest for-6.14/block
- udpated documetation on managed_irq
- updated commit message "blk-mq: issue warning when offlining hctx with online isolcpus"
- split input/output parameter in "lib/group_cpus: let group_cpu_evenly return number of groups"
- dropped "sched/isolation: document HK_TYPE housekeeping option"
- Link to v4: https://lore.kernel.org/r/20241217-isolcpus-io-queues-v4-0-5d355fbb1e14@kernel.org

Changes in v4:
- added "blk-mq: issue warning when offlining hctx with online isolcpus"
- fixed check in cgroup_cpus_evenly, the if condition needs to use
  housekeeping_enabled() and not cpusmask_weight(housekeeping_masks),
  because the later will always return a valid mask.
- dropped fixed tag from "lib/group_cpus.c: honor housekeeping config when
  grouping CPUs"
- fixed overlong line "scsi: use block layer helpers to calculate num
  of queues"
- dropped "sched/isolation: Add io_queue housekeeping option",
  just document the housekeep enum hk_type
- added "lib/group_cpus: let group_cpu_evenly return number of groups"
- collected tags
- splitted series into a preperation series:
  https://lore.kernel.org/linux-nvme/20241202-refactor-blk-affinity-helpers-v6-0-27211e9c2cd5@kernel.org/
- Link to v3: https://lore.kernel.org/r/20240806-isolcpus-io-queues-v3-0-da0eecfeaf8b@suse.de

Changes in v3:
- lifted a couple of patches from
  https://lore.kernel.org/all/20210709081005.421340-1-ming.lei@redhat.com/
  "virito: add APIs for retrieving vq affinity"
  "blk-mq: introduce blk_mq_dev_map_queues"
- replaces all users of blk_mq_[pci|virtio]_map_queues with
  blk_mq_dev_map_queues
- updated/extended number of queue calc helpers
- add isolcpus=io_queue CPU-hctx mapping function
- documented enum hk_type and isolcpus=io_queue
- added "scsi: pm8001: do not overwrite PCI queue mapping"
- Link to v2: https://lore.kernel.org/r/20240627-isolcpus-io-queues-v2-0-26a32e3c4f75@suse.de

Changes in v2:
- updated documentation
- splitted blk/nvme-pci patch
- dropped HK_TYPE_IO_QUEUE, use HK_TYPE_MANAGED_IRQ
- Link to v1: https://lore.kernel.org/r/20240621-isolcpus-io-queues-v1-0-8b169bf41083@suse.de

---
Daniel Wagner (12):
      scsi: aacraid: use block layer helpers to calculate num of queues
      lib/group_cpus: remove dead !SMP code
      lib/group_cpus: Add group_mask_cpus_evenly()
      genirq/affinity: Add cpumask to struct irq_affinity
      blk-mq: add blk_mq_{online|possible}_queue_affinity
      nvme-pci: use block layer helpers to constrain queue affinity
      scsi: Use block layer helpers to constrain queue affinity
      virtio: blk/scsi: use block layer helpers to constrain queue affinity
      isolation: Introduce io_queue isolcpus type
      blk-mq: use hk cpus only when isolcpus=io_queue is enabled
      blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
      docs: add io_queue flag to isolcpus

 Documentation/admin-guide/kernel-parameters.txt |  22 ++-
 block/blk-mq-cpumap.c                           | 201 +++++++++++++++++++++---
 block/blk-mq.c                                  |  42 +++++
 drivers/block/virtio_blk.c                      |   4 +-
 drivers/nvme/host/pci.c                         |   1 +
 drivers/scsi/aacraid/comminit.c                 |   3 +-
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c          |   1 +
 drivers/scsi/megaraid/megaraid_sas_base.c       |   5 +-
 drivers/scsi/mpi3mr/mpi3mr_fw.c                 |   6 +-
 drivers/scsi/mpt3sas/mpt3sas_base.c             |   5 +-
 drivers/scsi/pm8001/pm8001_init.c               |   1 +
 drivers/scsi/virtio_scsi.c                      |   5 +-
 include/linux/blk-mq.h                          |   2 +
 include/linux/group_cpus.h                      |   3 +
 include/linux/interrupt.h                       |  16 +-
 include/linux/sched/isolation.h                 |   1 +
 kernel/irq/affinity.c                           |  12 +-
 kernel/sched/isolation.c                        |   7 +
 lib/group_cpus.c                                |  63 ++++++--
 19 files changed, 353 insertions(+), 47 deletions(-)
---
base-commit: b320789d6883cc00ac78ce83bccbfe7ed58afcf0
change-id: 20240620-isolcpus-io-queues-1a88eb47ff8b

Best regards,
-- 
Daniel Wagner <wagi@kernel.org>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v8 01/12] scsi: aacraid: use block layer helpers to calculate num of queues
  2025-09-05 14:59 [PATCH v8 00/12] blk: honor isolcpus configuration Daniel Wagner
@ 2025-09-05 14:59 ` Daniel Wagner
  2025-09-08  6:06   ` Hannes Reinecke
  2025-09-05 14:59 ` [PATCH v8 02/12] lib/group_cpus: remove dead !SMP code Daniel Wagner
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2025-09-05 14:59 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, Aaron Tomlin, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

The calculation of the upper limit for queues does not depend solely on
the number of online CPUs; for example, the isolcpus kernel
command-line option must also be considered.

To account for this, the block layer provides a helper function to
retrieve the maximum number of queues. Use it to set an appropriate
upper queue number limit.

Fixes: 94970cfb5f10 ("scsi: use block layer helpers to calculate num of queues")
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/scsi/aacraid/comminit.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/scsi/aacraid/comminit.c b/drivers/scsi/aacraid/comminit.c
index 726c8531b7d3fbff4cc7b6a7ac4891f7bcb1c12f..788d7bf0a2d371fd3b38d88b0a9d76937f37d28b 100644
--- a/drivers/scsi/aacraid/comminit.c
+++ b/drivers/scsi/aacraid/comminit.c
@@ -469,8 +469,7 @@ void aac_define_int_mode(struct aac_dev *dev)
 	}
 
 	/* Don't bother allocating more MSI-X vectors than cpus */
-	msi_count = min(dev->max_msix,
-		(unsigned int)num_online_cpus());
+	msi_count = blk_mq_num_online_queues(dev->max_msix);
 
 	dev->max_msix = msi_count;
 

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 01/12] scsi: aacraid: use block layer helpers to calculate num of queues
  2025-09-05 14:59 ` [PATCH v8 01/12] scsi: aacraid: use block layer helpers to calculate num of queues Daniel Wagner
@ 2025-09-08  6:06   ` Hannes Reinecke
  0 siblings, 0 replies; 29+ messages in thread
From: Hannes Reinecke @ 2025-09-08  6:06 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Mathieu Desnoyers,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 9/5/25 16:59, Daniel Wagner wrote:
> The calculation of the upper limit for queues does not depend solely on
> the number of online CPUs; for example, the isolcpus kernel
> command-line option must also be considered.
> 
> To account for this, the block layer provides a helper function to
> retrieve the maximum number of queues. Use it to set an appropriate
> upper queue number limit.
> 
> Fixes: 94970cfb5f10 ("scsi: use block layer helpers to calculate num of queues")
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   drivers/scsi/aacraid/comminit.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v8 02/12] lib/group_cpus: remove dead !SMP code
  2025-09-05 14:59 [PATCH v8 00/12] blk: honor isolcpus configuration Daniel Wagner
  2025-09-05 14:59 ` [PATCH v8 01/12] scsi: aacraid: use block layer helpers to calculate num of queues Daniel Wagner
@ 2025-09-05 14:59 ` Daniel Wagner
  2025-09-08  6:06   ` Hannes Reinecke
  2025-09-05 14:59 ` [PATCH v8 03/12] lib/group_cpus: Add group_mask_cpus_evenly() Daniel Wagner
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2025-09-05 14:59 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, Aaron Tomlin, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

The support for the !SMP configuration has been removed from the core by
commit cac5cefbade9 ("sched/smp: Make SMP unconditional").

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 lib/group_cpus.c | 20 --------------------
 1 file changed, 20 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index 6d08ac05f371bf880571507d935d9eb501616a84..f254b232522d44c141cdc4e44e2c99a4148c08d6 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -9,8 +9,6 @@
 #include <linux/sort.h>
 #include <linux/group_cpus.h>
 
-#ifdef CONFIG_SMP
-
 static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 				unsigned int cpus_per_grp)
 {
@@ -425,22 +423,4 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
 	*nummasks = min(nr_present + nr_others, numgrps);
 	return masks;
 }
-#else /* CONFIG_SMP */
-struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
-{
-	struct cpumask *masks;
-
-	if (numgrps == 0)
-		return NULL;
-
-	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
-	if (!masks)
-		return NULL;
-
-	/* assign all CPUs(cpu 0) to the 1st group only */
-	cpumask_copy(&masks[0], cpu_possible_mask);
-	*nummasks = 1;
-	return masks;
-}
-#endif /* CONFIG_SMP */
 EXPORT_SYMBOL_GPL(group_cpus_evenly);

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 02/12] lib/group_cpus: remove dead !SMP code
  2025-09-05 14:59 ` [PATCH v8 02/12] lib/group_cpus: remove dead !SMP code Daniel Wagner
@ 2025-09-08  6:06   ` Hannes Reinecke
  0 siblings, 0 replies; 29+ messages in thread
From: Hannes Reinecke @ 2025-09-08  6:06 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Mathieu Desnoyers,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 9/5/25 16:59, Daniel Wagner wrote:
> The support for the !SMP configuration has been removed from the core by
> commit cac5cefbade9 ("sched/smp: Make SMP unconditional").
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   lib/group_cpus.c | 20 --------------------
>   1 file changed, 20 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v8 03/12] lib/group_cpus: Add group_mask_cpus_evenly()
  2025-09-05 14:59 [PATCH v8 00/12] blk: honor isolcpus configuration Daniel Wagner
  2025-09-05 14:59 ` [PATCH v8 01/12] scsi: aacraid: use block layer helpers to calculate num of queues Daniel Wagner
  2025-09-05 14:59 ` [PATCH v8 02/12] lib/group_cpus: remove dead !SMP code Daniel Wagner
@ 2025-09-05 14:59 ` Daniel Wagner
  2025-09-05 14:59 ` [PATCH v8 04/12] genirq/affinity: Add cpumask to struct irq_affinity Daniel Wagner
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Daniel Wagner @ 2025-09-05 14:59 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, Aaron Tomlin, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

group_mask_cpu_evenly() allows the caller to pass in a CPU mask that
should be evenly distributed. This new function is a more generic
version of the existing group_cpus_evenly(), which always distributes
all present CPUs into groups.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 include/linux/group_cpus.h |  3 +++
 lib/group_cpus.c           | 59 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 62 insertions(+)

diff --git a/include/linux/group_cpus.h b/include/linux/group_cpus.h
index 9d4e5ab6c314b31c09fda82c3f6ac18f77e9de36..defab4123a82fa37cb2a9920029be8e3e121ca0d 100644
--- a/include/linux/group_cpus.h
+++ b/include/linux/group_cpus.h
@@ -10,5 +10,8 @@
 #include <linux/cpu.h>
 
 struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks);
+struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
+				       const struct cpumask *mask,
+				       unsigned int *nummasks);
 
 #endif
diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index f254b232522d44c141cdc4e44e2c99a4148c08d6..ec0852132266618f540c580422f254684129ce90 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -8,6 +8,7 @@
 #include <linux/cpu.h>
 #include <linux/sort.h>
 #include <linux/group_cpus.h>
+#include <linux/sched/isolation.h>
 
 static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 				unsigned int cpus_per_grp)
@@ -424,3 +425,61 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
 	return masks;
 }
 EXPORT_SYMBOL_GPL(group_cpus_evenly);
+
+/**
+ * group_mask_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * @numgrps: number of cpumasks to create
+ * @mask: CPUs to consider for the grouping
+ * @nummasks: number of initialized cpusmasks
+ *
+ * Return: cpumask array if successful, NULL otherwise. Only the CPUs
+ * marked in the mask will be considered for the grouping. And each
+ * element includes CPUs assigned to this group. nummasks contains the
+ * number of initialized masks which can be less than numgrps. cpu_mask
+ *
+ * Try to put close CPUs from viewpoint of CPU and NUMA locality into
+ * same group, and run two-stage grouping:
+ *	1) allocate present CPUs on these groups evenly first
+ *	2) allocate other possible CPUs on these groups evenly
+ *
+ * We guarantee in the resulted grouping that all CPUs are covered, and
+ * no same CPU is assigned to multiple groups
+ */
+struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
+				       const struct cpumask *mask,
+				       unsigned int *nummasks)
+{
+	cpumask_var_t *node_to_cpumask;
+	cpumask_var_t nmsk;
+	int ret = -ENOMEM;
+	struct cpumask *masks = NULL;
+
+	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+		return NULL;
+
+	node_to_cpumask = alloc_node_to_cpumask();
+	if (!node_to_cpumask)
+		goto fail_nmsk;
+
+	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
+	if (!masks)
+		goto fail_node_to_cpumask;
+
+	build_node_to_cpumask(node_to_cpumask);
+
+	ret = __group_cpus_evenly(0, numgrps, node_to_cpumask, mask, nmsk,
+				  masks);
+
+fail_node_to_cpumask:
+	free_node_to_cpumask(node_to_cpumask);
+
+fail_nmsk:
+	free_cpumask_var(nmsk);
+	if (ret < 0) {
+		kfree(masks);
+		return NULL;
+	}
+	*nummasks = ret;
+	return masks;
+}
+EXPORT_SYMBOL_GPL(group_mask_cpus_evenly);

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 04/12] genirq/affinity: Add cpumask to struct irq_affinity
  2025-09-05 14:59 [PATCH v8 00/12] blk: honor isolcpus configuration Daniel Wagner
                   ` (2 preceding siblings ...)
  2025-09-05 14:59 ` [PATCH v8 03/12] lib/group_cpus: Add group_mask_cpus_evenly() Daniel Wagner
@ 2025-09-05 14:59 ` Daniel Wagner
  2025-09-10  8:22   ` Thomas Gleixner
  2025-09-05 14:59 ` [PATCH v8 05/12] blk-mq: add blk_mq_{online|possible}_queue_affinity Daniel Wagner
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2025-09-05 14:59 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, Aaron Tomlin, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

Pass a cpumask to irq_create_affinity_masks as an additional constraint
to consider when creating the affinity masks. This allows the caller to
exclude specific CPUs, e.g., isolated CPUs (see the 'isolcpus' kernel
command-line parameter).

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 include/linux/interrupt.h | 16 ++++++++++------
 kernel/irq/affinity.c     | 12 ++++++++++--
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 51b6484c049345c75816c4a63b4efa813f42f27b..b1a230953514da57e30e601727cd0e94796153d3 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -284,18 +284,22 @@ struct irq_affinity_notify {
  * @nr_sets:		The number of interrupt sets for which affinity
  *			spreading is required
  * @set_size:		Array holding the size of each interrupt set
+ * @mask:		cpumask that constrains which CPUs to consider when
+ *			calculating the number and size of the interrupt sets
  * @calc_sets:		Callback for calculating the number and size
  *			of interrupt sets
  * @priv:		Private data for usage by @calc_sets, usually a
  *			pointer to driver/device specific data.
  */
 struct irq_affinity {
-	unsigned int	pre_vectors;
-	unsigned int	post_vectors;
-	unsigned int	nr_sets;
-	unsigned int	set_size[IRQ_AFFINITY_MAX_SETS];
-	void		(*calc_sets)(struct irq_affinity *, unsigned int nvecs);
-	void		*priv;
+	unsigned int		pre_vectors;
+	unsigned int		post_vectors;
+	unsigned int		nr_sets;
+	unsigned int		set_size[IRQ_AFFINITY_MAX_SETS];
+	const struct cpumask	*mask;
+	void			(*calc_sets)(struct irq_affinity *,
+					     unsigned int nvecs);
+	void			*priv;
 };
 
 /**
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 4013e6ad2b2f1cb91de12bb428b3281105f7d23b..c68156f7847a7920103e39124676d06191304ef6 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -70,7 +70,13 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	 */
 	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
 		unsigned int nr_masks, this_vecs = affd->set_size[i];
-		struct cpumask *result = group_cpus_evenly(this_vecs, &nr_masks);
+		struct cpumask *result;
+
+		if (affd->mask)
+			result = group_mask_cpus_evenly(this_vecs, affd->mask,
+							&nr_masks);
+		else
+			result = group_cpus_evenly(this_vecs, &nr_masks);
 
 		if (!result) {
 			kfree(masks);
@@ -115,7 +121,9 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec,
 	if (resv > minvec)
 		return 0;
 
-	if (affd->calc_sets) {
+	if (affd->mask) {
+		set_vecs = cpumask_weight(affd->mask);
+	} else if (affd->calc_sets) {
 		set_vecs = maxvec - resv;
 	} else {
 		cpus_read_lock();

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 04/12] genirq/affinity: Add cpumask to struct irq_affinity
  2025-09-05 14:59 ` [PATCH v8 04/12] genirq/affinity: Add cpumask to struct irq_affinity Daniel Wagner
@ 2025-09-10  8:22   ` Thomas Gleixner
  0 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2025-09-10  8:22 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Mathieu Desnoyers, Aaron Tomlin,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream, Daniel Wagner

On Fri, Sep 05 2025 at 16:59, Daniel Wagner wrote:
> Pass a cpumask to irq_create_affinity_masks as an additional constraint
> to consider when creating the affinity masks. This allows the caller to
> exclude specific CPUs, e.g., isolated CPUs (see the 'isolcpus' kernel
> command-line parameter).
>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>  include/linux/interrupt.h | 16 ++++++++++------
>  kernel/irq/affinity.c     | 12 ++++++++++--
>  2 files changed, 20 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index 51b6484c049345c75816c4a63b4efa813f42f27b..b1a230953514da57e30e601727cd0e94796153d3 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -284,18 +284,22 @@ struct irq_affinity_notify {
>   * @nr_sets:		The number of interrupt sets for which affinity
>   *			spreading is required
>   * @set_size:		Array holding the size of each interrupt set
> + * @mask:		cpumask that constrains which CPUs to consider when
> + *			calculating the number and size of the interrupt sets

You surely couldn't come up with a less descriptive name for this
member, right?

> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index 4013e6ad2b2f1cb91de12bb428b3281105f7d23b..c68156f7847a7920103e39124676d06191304ef6 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -70,7 +70,13 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
>  	 */
>  	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
>  		unsigned int nr_masks, this_vecs = affd->set_size[i];
> -		struct cpumask *result = group_cpus_evenly(this_vecs, &nr_masks);
> +		struct cpumask *result;
> +
> +		if (affd->mask)
> +			result = group_mask_cpus_evenly(this_vecs, affd->mask,
> +							&nr_masks);

Please get rid of this line break. You have 100 characters.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v8 05/12] blk-mq: add blk_mq_{online|possible}_queue_affinity
  2025-09-05 14:59 [PATCH v8 00/12] blk: honor isolcpus configuration Daniel Wagner
                   ` (3 preceding siblings ...)
  2025-09-05 14:59 ` [PATCH v8 04/12] genirq/affinity: Add cpumask to struct irq_affinity Daniel Wagner
@ 2025-09-05 14:59 ` Daniel Wagner
  2025-09-05 14:59 ` [PATCH v8 06/12] nvme-pci: use block layer helpers to constrain queue affinity Daniel Wagner
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Daniel Wagner @ 2025-09-05 14:59 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, Aaron Tomlin, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

Introduce blk_mq_{online|possible}_queue_affinity, which returns the
queue-to-CPU mapping constraints defined by the block layer. This allows
other subsystems (e.g., IRQ affinity setup) to respect block layer
requirements.

It is necessary to provide versions for both the online and possible CPU
masks because some drivers want to spread their I/O queues only across
online CPUs, while others prefer to use all possible CPUs. And the mask
used needs to match with the number of queues requested
(see blk_num_{online|possible}_queues).

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq-cpumap.c  | 24 ++++++++++++++++++++++++
 include/linux/blk-mq.h |  2 ++
 2 files changed, 26 insertions(+)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 705da074ad6c7e88042296f21b739c6d686a72b6..8244ecf878358c0b8de84458dcd5100c2f360213 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -26,6 +26,30 @@ static unsigned int blk_mq_num_queues(const struct cpumask *mask,
 	return min_not_zero(num, max_queues);
 }
 
+/**
+ * blk_mq_possible_queue_affinity - Return block layer queue affinity
+ *
+ * Returns an affinity mask that represents the queue-to-CPU mapping
+ * requested by the block layer based on possible CPUs.
+ */
+const struct cpumask *blk_mq_possible_queue_affinity(void)
+{
+	return cpu_possible_mask;
+}
+EXPORT_SYMBOL_GPL(blk_mq_possible_queue_affinity);
+
+/**
+ * blk_mq_online_queue_affinity - Return block layer queue affinity
+ *
+ * Returns an affinity mask that represents the queue-to-CPU mapping
+ * requested by the block layer based on online CPUs.
+ */
+const struct cpumask *blk_mq_online_queue_affinity(void)
+{
+	return cpu_online_mask;
+}
+EXPORT_SYMBOL_GPL(blk_mq_online_queue_affinity);
+
 /**
  * blk_mq_num_possible_queues - Calc nr of queues for multiqueue devices
  * @max_queues:	The maximum number of queues the hardware/driver
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2a5a828f19a0ba6ff0812daf40eed67f0e12ada1..1144017dce47af82f9d010e42bfbf26fa4ddf33f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -947,6 +947,8 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 void blk_mq_unfreeze_queue_non_owner(struct request_queue *q);
 void blk_freeze_queue_start_non_owner(struct request_queue *q);
 
+const struct cpumask *blk_mq_possible_queue_affinity(void);
+const struct cpumask *blk_mq_online_queue_affinity(void);
 unsigned int blk_mq_num_possible_queues(unsigned int max_queues);
 unsigned int blk_mq_num_online_queues(unsigned int max_queues);
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap);

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 06/12] nvme-pci: use block layer helpers to constrain queue affinity
  2025-09-05 14:59 [PATCH v8 00/12] blk: honor isolcpus configuration Daniel Wagner
                   ` (4 preceding siblings ...)
  2025-09-05 14:59 ` [PATCH v8 05/12] blk-mq: add blk_mq_{online|possible}_queue_affinity Daniel Wagner
@ 2025-09-05 14:59 ` Daniel Wagner
  2025-09-05 14:59 ` [PATCH v8 07/12] scsi: Use " Daniel Wagner
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Daniel Wagner @ 2025-09-05 14:59 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, Aaron Tomlin, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

Ensure that IRQ affinity setup also respects the queue-to-CPU mapping
constraints provided by the block layer. This allows the NVMe driver
to avoid assigning interrupts to CPUs that the block layer has excluded
(e.g., isolated CPUs).

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/nvme/host/pci.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 2c6d9506b172509fb35716eba456c375f52f5b86..1d9c13aeddb12fa39eef68b7288d1f13eb98a0d7 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2604,6 +2604,7 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 		.pre_vectors	= 1,
 		.calc_sets	= nvme_calc_irq_sets,
 		.priv		= dev,
+		.mask		= blk_mq_possible_queue_affinity(),
 	};
 	unsigned int irq_queues, poll_queues;
 	unsigned int flags = PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY;

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 07/12] scsi: Use block layer helpers to constrain queue affinity
  2025-09-05 14:59 [PATCH v8 00/12] blk: honor isolcpus configuration Daniel Wagner
                   ` (5 preceding siblings ...)
  2025-09-05 14:59 ` [PATCH v8 06/12] nvme-pci: use block layer helpers to constrain queue affinity Daniel Wagner
@ 2025-09-05 14:59 ` Daniel Wagner
  2025-09-08  6:08   ` Hannes Reinecke
  2025-09-05 14:59 ` [PATCH v8 08/12] virtio: blk/scsi: use " Daniel Wagner
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2025-09-05 14:59 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, Aaron Tomlin, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

Ensure that IRQ affinity setup also respects the queue-to-CPU mapping
constraints provided by the block layer. This allows the SCSI drivers
to avoid assigning interrupts to CPUs that the block layer has excluded
(e.g., isolated CPUs).

Only convert drivers which are already using the
pci_alloc_irq_vectors_affinity with the PCI_IRQ_AFFINITY flag set.
Because these drivers are enabled to let the IRQ core code to
set the affinity. Also don't update qla2xxx because the nvme-fabrics
code is not ready yet.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    | 1 +
 drivers/scsi/megaraid/megaraid_sas_base.c | 5 ++++-
 drivers/scsi/mpi3mr/mpi3mr_fw.c           | 6 +++++-
 drivers/scsi/mpt3sas/mpt3sas_base.c       | 5 ++++-
 drivers/scsi/pm8001/pm8001_init.c         | 1 +
 5 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index 2f3d61abab3a66bf0b40a27b9411dc2cab1c44fc..9f3194ac9c0fb53d619e3a108935ef109308d131 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -2607,6 +2607,7 @@ static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
 	struct pci_dev *pdev = hisi_hba->pci_dev;
 	struct irq_affinity desc = {
 		.pre_vectors = BASE_VECTORS_V3_HW,
+		.mask = blk_mq_online_queue_affinity(),
 	};
 
 	min_msi = MIN_AFFINE_VECTORS_V3_HW;
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 615e06fd4ee8e5d1c14ef912460962eacb450c04..c8df2dc47689a5dad02e1364de1d71e24f6159d0 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -5927,7 +5927,10 @@ static int
 __megasas_alloc_irq_vectors(struct megasas_instance *instance)
 {
 	int i, irq_flags;
-	struct irq_affinity desc = { .pre_vectors = instance->low_latency_index_start };
+	struct irq_affinity desc = {
+		.pre_vectors = instance->low_latency_index_start,
+		.mask = blk_mq_online_queue_affinity(),
+	};
 	struct irq_affinity *descp = &desc;
 
 	irq_flags = PCI_IRQ_MSIX;
diff --git a/drivers/scsi/mpi3mr/mpi3mr_fw.c b/drivers/scsi/mpi3mr/mpi3mr_fw.c
index 0152d31d430abd17ab6b71f248435d9c7c417269..a8fbc84e0ab2ed7ca68a3b874ecfa78a8ebf0c47 100644
--- a/drivers/scsi/mpi3mr/mpi3mr_fw.c
+++ b/drivers/scsi/mpi3mr/mpi3mr_fw.c
@@ -825,7 +825,11 @@ static int mpi3mr_setup_isr(struct mpi3mr_ioc *mrioc, u8 setup_one)
 	int max_vectors, min_vec;
 	int retval;
 	int i;
-	struct irq_affinity desc = { .pre_vectors =  1, .post_vectors = 1 };
+	struct irq_affinity desc = {
+		.pre_vectors =  1,
+		.post_vectors = 1,
+		.mask = blk_mq_online_queue_affinity(),
+	};
 
 	if (mrioc->is_intr_info_set)
 		return 0;
diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c b/drivers/scsi/mpt3sas/mpt3sas_base.c
index bd3efa5b46c780d43fae58c12f0bce5057dcda85..a55dd75221a6079a29f6ebee402b3654b94411c1 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -3364,7 +3364,10 @@ static int
 _base_alloc_irq_vectors(struct MPT3SAS_ADAPTER *ioc)
 {
 	int i, irq_flags = PCI_IRQ_MSIX;
-	struct irq_affinity desc = { .pre_vectors = ioc->high_iops_queues };
+	struct irq_affinity desc = {
+		.pre_vectors = ioc->high_iops_queues,
+		.mask = blk_mq_online_queue_affinity(),
+	};
 	struct irq_affinity *descp = &desc;
 	/*
 	 * Don't allocate msix vectors for poll_queues.
diff --git a/drivers/scsi/pm8001/pm8001_init.c b/drivers/scsi/pm8001/pm8001_init.c
index 599410bcdfea59aba40e3dd6749434b7b5966d48..1d4807eeed75acdfe091a3c0560a926ebb59e1e8 100644
--- a/drivers/scsi/pm8001/pm8001_init.c
+++ b/drivers/scsi/pm8001/pm8001_init.c
@@ -977,6 +977,7 @@ static u32 pm8001_setup_msix(struct pm8001_hba_info *pm8001_ha)
 		 */
 		struct irq_affinity desc = {
 			.pre_vectors = 1,
+			.mask = blk_mq_online_queue_affinity(),
 		};
 		rc = pci_alloc_irq_vectors_affinity(
 				pm8001_ha->pdev, 2, PM8001_MAX_MSIX_VEC,

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 07/12] scsi: Use block layer helpers to constrain queue affinity
  2025-09-05 14:59 ` [PATCH v8 07/12] scsi: Use " Daniel Wagner
@ 2025-09-08  6:08   ` Hannes Reinecke
  0 siblings, 0 replies; 29+ messages in thread
From: Hannes Reinecke @ 2025-09-08  6:08 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Mathieu Desnoyers,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 9/5/25 16:59, Daniel Wagner wrote:
> Ensure that IRQ affinity setup also respects the queue-to-CPU mapping
> constraints provided by the block layer. This allows the SCSI drivers
> to avoid assigning interrupts to CPUs that the block layer has excluded
> (e.g., isolated CPUs).
> 
> Only convert drivers which are already using the
> pci_alloc_irq_vectors_affinity with the PCI_IRQ_AFFINITY flag set.
> Because these drivers are enabled to let the IRQ core code to
> set the affinity. Also don't update qla2xxx because the nvme-fabrics
> code is not ready yet.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    | 1 +
>   drivers/scsi/megaraid/megaraid_sas_base.c | 5 ++++-
>   drivers/scsi/mpi3mr/mpi3mr_fw.c           | 6 +++++-
>   drivers/scsi/mpt3sas/mpt3sas_base.c       | 5 ++++-
>   drivers/scsi/pm8001/pm8001_init.c         | 1 +
>   5 files changed, 15 insertions(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v8 08/12] virtio: blk/scsi: use block layer helpers to constrain queue affinity
  2025-09-05 14:59 [PATCH v8 00/12] blk: honor isolcpus configuration Daniel Wagner
                   ` (6 preceding siblings ...)
  2025-09-05 14:59 ` [PATCH v8 07/12] scsi: Use " Daniel Wagner
@ 2025-09-05 14:59 ` Daniel Wagner
  2025-09-05 14:59 ` [PATCH v8 09/12] isolation: Introduce io_queue isolcpus type Daniel Wagner
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Daniel Wagner @ 2025-09-05 14:59 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, Aaron Tomlin, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

Ensure that IRQ affinity setup also respects the queue-to-CPU mapping
constraints provided by the block layer. This allows the virtio drivers
to avoid assigning interrupts to CPUs that the block layer has excluded
(e.g., isolated CPUs).

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/block/virtio_blk.c | 4 +++-
 drivers/scsi/virtio_scsi.c | 5 ++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index e649fa67bac16b4f0c6e8e8f0e6bec111897c355..41b06540c7fb22fd1d2708338c514947c4bdeefe 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -963,7 +963,9 @@ static int init_vq(struct virtio_blk *vblk)
 	unsigned short num_vqs;
 	unsigned short num_poll_vqs;
 	struct virtio_device *vdev = vblk->vdev;
-	struct irq_affinity desc = { 0, };
+	struct irq_affinity desc = {
+		.mask = blk_mq_possible_queue_affinity(),
+	};
 
 	err = virtio_cread_feature(vdev, VIRTIO_BLK_F_MQ,
 				   struct virtio_blk_config, num_queues,
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 96a69edddbe5555574fc8fed1ba7c82a99df4472..67dfb265bf9e54adc68978ac8d93187e6629c330 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -842,7 +842,10 @@ static int virtscsi_init(struct virtio_device *vdev,
 	u32 num_vqs, num_poll_vqs, num_req_vqs;
 	struct virtqueue_info *vqs_info;
 	struct virtqueue **vqs;
-	struct irq_affinity desc = { .pre_vectors = 2 };
+	struct irq_affinity desc = {
+		.pre_vectors = 2,
+		.mask = blk_mq_possible_queue_affinity(),
+	};
 
 	num_req_vqs = vscsi->num_queues;
 	num_vqs = num_req_vqs + VIRTIO_SCSI_VQ_BASE;

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 09/12] isolation: Introduce io_queue isolcpus type
  2025-09-05 14:59 [PATCH v8 00/12] blk: honor isolcpus configuration Daniel Wagner
                   ` (7 preceding siblings ...)
  2025-09-05 14:59 ` [PATCH v8 08/12] virtio: blk/scsi: use " Daniel Wagner
@ 2025-09-05 14:59 ` Daniel Wagner
  2025-09-05 14:59 ` [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Daniel Wagner @ 2025-09-05 14:59 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, Aaron Tomlin, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

Multiqueue drivers spread I/O queues across all CPUs for optimal
performance. However, these drivers are not aware of CPU isolation
requirements and will distribute queues without considering the isolcpus
configuration.

Introduce a new isolcpus mask that allows users to define which CPUs
should have I/O queues assigned. This is similar to managed_irq, but
intended for drivers that do not use the managed IRQ infrastructure

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 include/linux/sched/isolation.h | 1 +
 kernel/sched/isolation.c        | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index d8501f4709b583b8a1c91574446382f093bccdb1..6b6ae9c5b2f61a93c649a98ea27482b932627fca 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -9,6 +9,7 @@
 enum hk_type {
 	HK_TYPE_DOMAIN,
 	HK_TYPE_MANAGED_IRQ,
+	HK_TYPE_IO_QUEUE,
 	HK_TYPE_KERNEL_NOISE,
 	HK_TYPE_MAX,
 
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index a4cf17b1fab062f536c7f4f47c35f0e209fd25d6..0d59cc95bf3b8fa2f06cb562ce1baf3fdd48c9db 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -13,6 +13,7 @@
 enum hk_flags {
 	HK_FLAG_DOMAIN		= BIT(HK_TYPE_DOMAIN),
 	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
+	HK_FLAG_IO_QUEUE	= BIT(HK_TYPE_IO_QUEUE),
 	HK_FLAG_KERNEL_NOISE	= BIT(HK_TYPE_KERNEL_NOISE),
 };
 
@@ -226,6 +227,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
 			continue;
 		}
 
+		if (!strncmp(str, "io_queue,", 9)) {
+			str += 9;
+			flags |= HK_FLAG_IO_QUEUE;
+			continue;
+		}
+
 		/*
 		 * Skip unknown sub-parameter and validate that it is not
 		 * containing an invalid character.

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-09-05 14:59 [PATCH v8 00/12] blk: honor isolcpus configuration Daniel Wagner
                   ` (8 preceding siblings ...)
  2025-09-05 14:59 ` [PATCH v8 09/12] isolation: Introduce io_queue isolcpus type Daniel Wagner
@ 2025-09-05 14:59 ` Daniel Wagner
  2025-09-08  6:13   ` Hannes Reinecke
                     ` (2 more replies)
  2025-09-05 14:59 ` [PATCH v8 11/12] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Daniel Wagner
                   ` (2 subsequent siblings)
  12 siblings, 3 replies; 29+ messages in thread
From: Daniel Wagner @ 2025-09-05 14:59 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, Aaron Tomlin, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

Extend the capabilities of the generic CPU to hardware queue (hctx)
mapping code, so it maps houskeeping CPUs and isolated CPUs to the
hardware queues evenly.

A hctx is only operational when there is at least one online
housekeeping CPU assigned (aka active_hctx). Thus, check the final
mapping that there is no hctx which has only offline housekeeing CPU and
online isolated CPUs.

Example mapping result:

  16 online CPUs

  isolcpus=io_queue,2-3,6-7,12-13

Queue mapping:
        hctx0: default 0 2
        hctx1: default 1 3
        hctx2: default 4 6
        hctx3: default 5 7
        hctx4: default 8 12
        hctx5: default 9 13
        hctx6: default 10
        hctx7: default 11
        hctx8: default 14
        hctx9: default 15

IRQ mapping:
        irq 42 affinity 0 effective 0  nvme0q0
        irq 43 affinity 0 effective 0  nvme0q1
        irq 44 affinity 1 effective 1  nvme0q2
        irq 45 affinity 4 effective 4  nvme0q3
        irq 46 affinity 5 effective 5  nvme0q4
        irq 47 affinity 8 effective 8  nvme0q5
        irq 48 affinity 9 effective 9  nvme0q6
        irq 49 affinity 10 effective 10  nvme0q7
        irq 50 affinity 11 effective 11  nvme0q8
        irq 51 affinity 14 effective 14  nvme0q9
        irq 52 affinity 15 effective 15  nvme0q10

A corner case is when the number of online CPUs and present CPUs
differ and the driver asks for less queues than online CPUs, e.g.

  8 online CPUs, 16 possible CPUs

  isolcpus=io_queue,2-3,6-7,12-13
  virtio_blk.num_request_queues=2

Queue mapping:
        hctx0: default 0 1 2 3 4 5 6 7 8 12 13
        hctx1: default 9 10 11 14 15

IRQ mapping
        irq 27 affinity 0 effective 0 virtio0-config
        irq 28 affinity 0-1,4-5,8 effective 5 virtio0-req.0
        irq 29 affinity 9-11,14-15 effective 0 virtio0-req.1

Noteworthy is that for the normal/default configuration (!isoclpus) the
mapping will change for systems which have non hyperthreading CPUs. The
main assignment loop will completely rely that group_mask_cpus_evenly to
do the right thing. The old code would distribute the CPUs linearly over
the hardware context:

queue mapping for /dev/nvme0n1
        hctx0: default 0 8
        hctx1: default 1 9
        hctx2: default 2 10
        hctx3: default 3 11
        hctx4: default 4 12
        hctx5: default 5 13
        hctx6: default 6 14
        hctx7: default 7 15

The assign each hardware context the map generated by the
group_mask_cpus_evenly function:

queue mapping for /dev/nvme0n1
        hctx0: default 0 1
        hctx1: default 2 3
        hctx2: default 4 5
        hctx3: default 6 7
        hctx4: default 8 9
        hctx5: default 10 11
        hctx6: default 12 13
        hctx7: default 14 15

In case of hyperthreading CPUs, the resulting map stays the same.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq-cpumap.c | 177 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 158 insertions(+), 19 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 8244ecf878358c0b8de84458dcd5100c2f360213..1e66882e4d5bd9f78d132f3a229a1577853f7a9f 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -17,12 +17,25 @@
 #include "blk.h"
 #include "blk-mq.h"
 
+static struct cpumask blk_hk_online_mask;
+
 static unsigned int blk_mq_num_queues(const struct cpumask *mask,
 				      unsigned int max_queues)
 {
 	unsigned int num;
 
-	num = cpumask_weight(mask);
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
+		const struct cpumask *hk_mask;
+		struct cpumask avail_mask;
+
+		hk_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+		cpumask_and(&avail_mask, mask, hk_mask);
+
+		num = cpumask_weight(&avail_mask);
+	} else {
+		num = cpumask_weight(mask);
+	}
+
 	return min_not_zero(num, max_queues);
 }
 
@@ -31,9 +44,13 @@ static unsigned int blk_mq_num_queues(const struct cpumask *mask,
  *
  * Returns an affinity mask that represents the queue-to-CPU mapping
  * requested by the block layer based on possible CPUs.
+ * This helper takes isolcpus settings into account.
  */
 const struct cpumask *blk_mq_possible_queue_affinity(void)
 {
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
+		return housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+
 	return cpu_possible_mask;
 }
 EXPORT_SYMBOL_GPL(blk_mq_possible_queue_affinity);
@@ -46,6 +63,12 @@ EXPORT_SYMBOL_GPL(blk_mq_possible_queue_affinity);
  */
 const struct cpumask *blk_mq_online_queue_affinity(void)
 {
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
+		cpumask_and(&blk_hk_online_mask, cpu_online_mask,
+			    housekeeping_cpumask(HK_TYPE_IO_QUEUE));
+		return &blk_hk_online_mask;
+	}
+
 	return cpu_online_mask;
 }
 EXPORT_SYMBOL_GPL(blk_mq_online_queue_affinity);
@@ -57,7 +80,8 @@ EXPORT_SYMBOL_GPL(blk_mq_online_queue_affinity);
  *		ignored.
  *
  * Calculates the number of queues to be used for a multiqueue
- * device based on the number of possible CPUs.
+ * device based on the number of possible CPUs. This helper
+ * takes isolcpus settings into account.
  */
 unsigned int blk_mq_num_possible_queues(unsigned int max_queues)
 {
@@ -72,7 +96,8 @@ EXPORT_SYMBOL_GPL(blk_mq_num_possible_queues);
  *		ignored.
  *
  * Calculates the number of queues to be used for a multiqueue
- * device based on the number of online CPUs.
+ * device based on the number of online CPUs. This helper
+ * takes isolcpus settings into account.
  */
 unsigned int blk_mq_num_online_queues(unsigned int max_queues)
 {
@@ -80,23 +105,104 @@ unsigned int blk_mq_num_online_queues(unsigned int max_queues)
 }
 EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
 
+static bool blk_mq_validate(struct blk_mq_queue_map *qmap,
+			    const struct cpumask *active_hctx)
+{
+	/*
+	 * Verify if the mapping is usable when housekeeping
+	 * configuration is enabled
+	 */
+
+	for (int queue = 0; queue < qmap->nr_queues; queue++) {
+		int cpu;
+
+		if (cpumask_test_cpu(queue, active_hctx)) {
+			/*
+			 * This htcx has at least one online CPU thus it
+			 * is able to serve any assigned isolated CPU.
+			 */
+			continue;
+		}
+
+		/*
+		 * There is no housekeeping online CPU for this hctx, all
+		 * good as long as all non houskeeping CPUs are also
+		 * offline.
+		 */
+		for_each_online_cpu(cpu) {
+			if (qmap->mq_map[cpu] != queue)
+				continue;
+
+			pr_warn("Unable to create a usable CPU-to-queue mapping with the given constraints\n");
+			return false;
+		}
+	}
+
+	return true;
+}
+
+static void blk_mq_map_fallback(struct blk_mq_queue_map *qmap)
+{
+	unsigned int cpu;
+
+	/*
+	 * Map all CPUs to the first hctx to ensure at least one online
+	 * CPU is serving it.
+	 */
+	for_each_possible_cpu(cpu)
+		qmap->mq_map[cpu] = 0;
+}
+
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
-	const struct cpumask *masks;
+	struct cpumask *masks __free(kfree) = NULL;
+	const struct cpumask *constraint;
 	unsigned int queue, cpu, nr_masks;
+	cpumask_var_t active_hctx;
 
-	masks = group_cpus_evenly(qmap->nr_queues, &nr_masks);
-	if (!masks) {
-		for_each_possible_cpu(cpu)
-			qmap->mq_map[cpu] = qmap->queue_offset;
-		return;
-	}
+	if (!zalloc_cpumask_var(&active_hctx, GFP_KERNEL))
+		goto fallback;
+
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
+		constraint = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+	else
+		constraint = cpu_possible_mask;
+
+	/* Map CPUs to the hardware contexts (hctx) */
+	masks = group_mask_cpus_evenly(qmap->nr_queues, constraint, &nr_masks);
+	if (!masks)
+		goto free_fallback;
 
 	for (queue = 0; queue < qmap->nr_queues; queue++) {
-		for_each_cpu(cpu, &masks[queue % nr_masks])
-			qmap->mq_map[cpu] = qmap->queue_offset + queue;
+		unsigned int idx = (qmap->queue_offset + queue) % nr_masks;
+
+		for_each_cpu(cpu, &masks[idx]) {
+			qmap->mq_map[cpu] = idx;
+
+			if (cpu_online(cpu))
+				cpumask_set_cpu(qmap->mq_map[cpu], active_hctx);
+		}
 	}
-	kfree(masks);
+
+	/* Map any unassigned CPU evenly to the hardware contexts (hctx) */
+	queue = cpumask_first(active_hctx);
+	for_each_cpu_andnot(cpu, cpu_possible_mask, constraint) {
+		qmap->mq_map[cpu] = (qmap->queue_offset + queue) % nr_masks;
+		queue = cpumask_next_wrap(queue, active_hctx);
+	}
+
+	if (!blk_mq_validate(qmap, active_hctx))
+		goto free_fallback;
+
+	free_cpumask_var(active_hctx);
+
+	return;
+
+free_fallback:
+	free_cpumask_var(active_hctx);
+
+fallback:
+	blk_mq_map_fallback(qmap);
 }
 EXPORT_SYMBOL_GPL(blk_mq_map_queues);
 
@@ -133,24 +239,57 @@ void blk_mq_map_hw_queues(struct blk_mq_queue_map *qmap,
 			  struct device *dev, unsigned int offset)
 
 {
-	const struct cpumask *mask;
+	cpumask_var_t active_hctx, mask;
 	unsigned int queue, cpu;
 
 	if (!dev->bus->irq_get_affinity)
 		goto fallback;
 
+	if (!zalloc_cpumask_var(&active_hctx, GFP_KERNEL))
+		goto fallback;
+
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
+		free_cpumask_var(active_hctx);
+		goto fallback;
+	}
+
+	/* Map CPUs to the hardware contexts (hctx) */
 	for (queue = 0; queue < qmap->nr_queues; queue++) {
-		mask = dev->bus->irq_get_affinity(dev, queue + offset);
-		if (!mask)
-			goto fallback;
+		const struct cpumask *affinity_mask;
+
+		affinity_mask = dev->bus->irq_get_affinity(dev, offset + queue);
+		if (!affinity_mask)
+			goto free_fallback;
 
-		for_each_cpu(cpu, mask)
+		for_each_cpu(cpu, affinity_mask) {
 			qmap->mq_map[cpu] = qmap->queue_offset + queue;
+
+			cpumask_set_cpu(cpu, mask);
+			if (cpu_online(cpu))
+				cpumask_set_cpu(qmap->mq_map[cpu], active_hctx);
+		}
+	}
+
+	/* Map any unassigned CPU evenly to the hardware contexts (hctx) */
+	queue = cpumask_first(active_hctx);
+	for_each_cpu_andnot(cpu, cpu_possible_mask, mask) {
+		qmap->mq_map[cpu] = qmap->queue_offset + queue;
+		queue = cpumask_next_wrap(queue, active_hctx);
 	}
 
+	if (!blk_mq_validate(qmap, active_hctx))
+		goto free_fallback;
+
+	free_cpumask_var(active_hctx);
+	free_cpumask_var(mask);
+
 	return;
 
+free_fallback:
+	free_cpumask_var(active_hctx);
+	free_cpumask_var(mask);
+
 fallback:
-	blk_mq_map_queues(qmap);
+	blk_mq_map_fallback(qmap);
 }
 EXPORT_SYMBOL_GPL(blk_mq_map_hw_queues);

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-09-05 14:59 ` [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
@ 2025-09-08  6:13   ` Hannes Reinecke
  2025-09-08  7:26     ` Daniel Wagner
  2025-09-08  7:36   ` Daniel Wagner
  2025-09-10  6:05   ` kernel test robot
  2 siblings, 1 reply; 29+ messages in thread
From: Hannes Reinecke @ 2025-09-08  6:13 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Mathieu Desnoyers,
	linux-kernel, linux-block, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On 9/5/25 16:59, Daniel Wagner wrote:
> Extend the capabilities of the generic CPU to hardware queue (hctx)
> mapping code, so it maps houskeeping CPUs and isolated CPUs to the
> hardware queues evenly.
> 
> A hctx is only operational when there is at least one online
> housekeeping CPU assigned (aka active_hctx). Thus, check the final
> mapping that there is no hctx which has only offline housekeeing CPU and
> online isolated CPUs.
> 
> Example mapping result:
> 
>    16 online CPUs
> 
>    isolcpus=io_queue,2-3,6-7,12-13
> 
> Queue mapping:
>          hctx0: default 0 2
>          hctx1: default 1 3
>          hctx2: default 4 6
>          hctx3: default 5 7
>          hctx4: default 8 12
>          hctx5: default 9 13
>          hctx6: default 10
>          hctx7: default 11
>          hctx8: default 14
>          hctx9: default 15
> 
> IRQ mapping:
>          irq 42 affinity 0 effective 0  nvme0q0
>          irq 43 affinity 0 effective 0  nvme0q1
>          irq 44 affinity 1 effective 1  nvme0q2
>          irq 45 affinity 4 effective 4  nvme0q3
>          irq 46 affinity 5 effective 5  nvme0q4
>          irq 47 affinity 8 effective 8  nvme0q5
>          irq 48 affinity 9 effective 9  nvme0q6
>          irq 49 affinity 10 effective 10  nvme0q7
>          irq 50 affinity 11 effective 11  nvme0q8
>          irq 51 affinity 14 effective 14  nvme0q9
>          irq 52 affinity 15 effective 15  nvme0q10
> 
> A corner case is when the number of online CPUs and present CPUs
> differ and the driver asks for less queues than online CPUs, e.g.
> 
>    8 online CPUs, 16 possible CPUs
> 
>    isolcpus=io_queue,2-3,6-7,12-13
>    virtio_blk.num_request_queues=2
> 
> Queue mapping:
>          hctx0: default 0 1 2 3 4 5 6 7 8 12 13
>          hctx1: default 9 10 11 14 15
> 
> IRQ mapping
>          irq 27 affinity 0 effective 0 virtio0-config
>          irq 28 affinity 0-1,4-5,8 effective 5 virtio0-req.0
>          irq 29 affinity 9-11,14-15 effective 0 virtio0-req.1
> 
> Noteworthy is that for the normal/default configuration (!isoclpus) the
> mapping will change for systems which have non hyperthreading CPUs. The
> main assignment loop will completely rely that group_mask_cpus_evenly to
> do the right thing. The old code would distribute the CPUs linearly over
> the hardware context:
> 
> queue mapping for /dev/nvme0n1
>          hctx0: default 0 8
>          hctx1: default 1 9
>          hctx2: default 2 10
>          hctx3: default 3 11
>          hctx4: default 4 12
>          hctx5: default 5 13
>          hctx6: default 6 14
>          hctx7: default 7 15
> 
> The assign each hardware context the map generated by the
> group_mask_cpus_evenly function:
> 
> queue mapping for /dev/nvme0n1
>          hctx0: default 0 1
>          hctx1: default 2 3
>          hctx2: default 4 5
>          hctx3: default 6 7
>          hctx4: default 8 9
>          hctx5: default 10 11
>          hctx6: default 12 13
>          hctx7: default 14 15
> 
> In case of hyperthreading CPUs, the resulting map stays the same.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   block/blk-mq-cpumap.c | 177 ++++++++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 158 insertions(+), 19 deletions(-)
> 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 8244ecf878358c0b8de84458dcd5100c2f360213..1e66882e4d5bd9f78d132f3a229a1577853f7a9f 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -17,12 +17,25 @@
>   #include "blk.h"
>   #include "blk-mq.h"
>   
> +static struct cpumask blk_hk_online_mask;
> +
>   static unsigned int blk_mq_num_queues(const struct cpumask *mask,
>   				      unsigned int max_queues)
>   {
>   	unsigned int num;
>   
> -	num = cpumask_weight(mask);
> +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
> +		const struct cpumask *hk_mask;
> +		struct cpumask avail_mask;
> +
> +		hk_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
> +		cpumask_and(&avail_mask, mask, hk_mask);
> +
> +		num = cpumask_weight(&avail_mask);
> +	} else {
> +		num = cpumask_weight(mask);
> +	}
> +
>   	return min_not_zero(num, max_queues);
>   }
>   
> @@ -31,9 +44,13 @@ static unsigned int blk_mq_num_queues(const struct cpumask *mask,
>    *
>    * Returns an affinity mask that represents the queue-to-CPU mapping
>    * requested by the block layer based on possible CPUs.
> + * This helper takes isolcpus settings into account.
>    */
>   const struct cpumask *blk_mq_possible_queue_affinity(void)
>   {
> +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
> +		return housekeeping_cpumask(HK_TYPE_IO_QUEUE);
> +
>   	return cpu_possible_mask;
>   }
>   EXPORT_SYMBOL_GPL(blk_mq_possible_queue_affinity);
> @@ -46,6 +63,12 @@ EXPORT_SYMBOL_GPL(blk_mq_possible_queue_affinity);
>    */
>   const struct cpumask *blk_mq_online_queue_affinity(void)
>   {
> +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
> +		cpumask_and(&blk_hk_online_mask, cpu_online_mask,
> +			    housekeeping_cpumask(HK_TYPE_IO_QUEUE));
> +		return &blk_hk_online_mask;

Can you explain the use of 'blk_hk_online_mask'?
Why is a static variable?
To my untrained eye it's being recalculated every time one calls
this function. And only the first invocation run on an empty mask,
all subsequent ones see a populated mask.

Is that the intention?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-09-08  6:13   ` Hannes Reinecke
@ 2025-09-08  7:26     ` Daniel Wagner
  2025-09-08  7:51       ` Hannes Reinecke
  2025-09-10  8:20       ` Thomas Gleixner
  0 siblings, 2 replies; 29+ messages in thread
From: Daniel Wagner @ 2025-09-08  7:26 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin, Aaron Tomlin,
	Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream

On Mon, Sep 08, 2025 at 08:13:31AM +0200, Hannes Reinecke wrote:
> >   const struct cpumask *blk_mq_online_queue_affinity(void)
> >   {
> > +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
> > +		cpumask_and(&blk_hk_online_mask, cpu_online_mask,
> > +			    housekeeping_cpumask(HK_TYPE_IO_QUEUE));
> > +		return &blk_hk_online_mask;
> 
> Can you explain the use of 'blk_hk_online_mask'?
> Why is a static variable?

The blk_mq_*_queue_affinity helpers return a const struct cpumask *, the
caller doesn't need to free the return value. Because cpumask_and needs
store its result somewhere, I opted for the global static variable.

> To my untrained eye it's being recalculated every time one calls
> this function. And only the first invocation run on an empty mask,
> all subsequent ones see a populated mask.

The cpu_online_mask might change over time, it's not a static bitmap.
Thus it's necessary to update the blk_hk_online_mask. Doing some sort of
caching is certainly possible. Given that we have plenty of cpumask
logic operation in the cpu_group_evenly code path later, I am not so
sure this really makes a huge difference.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-09-08  7:26     ` Daniel Wagner
@ 2025-09-08  7:51       ` Hannes Reinecke
  2025-09-08  8:08         ` Daniel Wagner
  2025-09-10  8:20       ` Thomas Gleixner
  1 sibling, 1 reply; 29+ messages in thread
From: Hannes Reinecke @ 2025-09-08  7:51 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin, Aaron Tomlin,
	Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream

On 9/8/25 09:26, Daniel Wagner wrote:
> On Mon, Sep 08, 2025 at 08:13:31AM +0200, Hannes Reinecke wrote:
>>>    const struct cpumask *blk_mq_online_queue_affinity(void)
>>>    {
>>> +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
>>> +		cpumask_and(&blk_hk_online_mask, cpu_online_mask,
>>> +			    housekeeping_cpumask(HK_TYPE_IO_QUEUE));
>>> +		return &blk_hk_online_mask;
>>
>> Can you explain the use of 'blk_hk_online_mask'?
>> Why is a static variable?
> 
> The blk_mq_*_queue_affinity helpers return a const struct cpumask *, the
> caller doesn't need to free the return value. Because cpumask_and needs
> store its result somewhere, I opted for the global static variable.
> 
>> To my untrained eye it's being recalculated every time one calls
>> this function. And only the first invocation run on an empty mask,
>> all subsequent ones see a populated mask.
> 
> The cpu_online_mask might change over time, it's not a static bitmap.
> Thus it's necessary to update the blk_hk_online_mask. Doing some sort of
> caching is certainly possible. Given that we have plenty of cpumask
> logic operation in the cpu_group_evenly code path later, I am not so
> sure this really makes a huge difference.

Oh, that's okay. I'm perfectly fine to update the cpumask on every call.
What makes me wonder is the initialisation of blk_hk_online_mask;
it's zeroed _at boot_, and then all we do is calling cpumask_and()
for every invocation. So the mask will only increase in scope,
and never decrease.

Wouldn't it be better to call 'cpumask_zero' before 'cpumask_and'?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-09-08  7:51       ` Hannes Reinecke
@ 2025-09-08  8:08         ` Daniel Wagner
  0 siblings, 0 replies; 29+ messages in thread
From: Daniel Wagner @ 2025-09-08  8:08 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin, Aaron Tomlin,
	Martin K. Petersen, Thomas Gleixner, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream

On Mon, Sep 08, 2025 at 09:51:34AM +0200, Hannes Reinecke wrote:
> Wouldn't it be better to call 'cpumask_zero' before 'cpumask_and'?

I don't think this is necessary, from the docs :

  cpumask_and - *dstp = *src1p & *src2p

cpumask_and call bitmap_and which is:

static __always_inline
bool bitmap_and(unsigned long *dst, const unsigned long *src1,
		const unsigned long *src2, unsigned int nbits)
{
	if (small_const_nbits(nbits))
		return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
	return __bitmap_and(dst, src1, src2, nbits);
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-09-08  7:26     ` Daniel Wagner
  2025-09-08  7:51       ` Hannes Reinecke
@ 2025-09-10  8:20       ` Thomas Gleixner
  2025-09-12  8:32         ` Daniel Wagner
  1 sibling, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2025-09-10  8:20 UTC (permalink / raw)
  To: Daniel Wagner, Hannes Reinecke
  Cc: Daniel Wagner, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Michael S. Tsirkin, Aaron Tomlin,
	Martin K. Petersen, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream

On Mon, Sep 08 2025 at 09:26, Daniel Wagner wrote:
> On Mon, Sep 08, 2025 at 08:13:31AM +0200, Hannes Reinecke wrote:
>> >   const struct cpumask *blk_mq_online_queue_affinity(void)
>> >   {
>> > +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
>> > +		cpumask_and(&blk_hk_online_mask, cpu_online_mask,
>> > +			    housekeeping_cpumask(HK_TYPE_IO_QUEUE));
>> > +		return &blk_hk_online_mask;
>> 
>> Can you explain the use of 'blk_hk_online_mask'?
>> Why is a static variable?
>
> The blk_mq_*_queue_affinity helpers return a const struct cpumask *, the
> caller doesn't need to free the return value. Because cpumask_and needs
> store its result somewhere, I opted for the global static variable.
>
>> To my untrained eye it's being recalculated every time one calls
>> this function. And only the first invocation run on an empty mask,
>> all subsequent ones see a populated mask.
>
> The cpu_online_mask might change over time, it's not a static bitmap.
> Thus it's necessary to update the blk_hk_online_mask. Doing some sort of
> caching is certainly possible. Given that we have plenty of cpumask
> logic operation in the cpu_group_evenly code path later, I am not so
> sure this really makes a huge difference.

Sure,  but none of this is serialized against CPU hotplug operations. So
the resulting mask, which is handed into the spreading code can be
concurrently modified. IOW it's not as const as the code claims.

How is this even remotely correct?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-09-10  8:20       ` Thomas Gleixner
@ 2025-09-12  8:32         ` Daniel Wagner
  2025-09-12 14:31           ` Thomas Gleixner
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2025-09-12  8:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Hannes Reinecke, Daniel Wagner, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Michael S. Tsirkin,
	Aaron Tomlin, Martin K. Petersen, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream

On Wed, Sep 10, 2025 at 10:20:26AM +0200, Thomas Gleixner wrote:
> On Mon, Sep 08 2025 at 09:26, Daniel Wagner wrote:
> > On Mon, Sep 08, 2025 at 08:13:31AM +0200, Hannes Reinecke wrote:
> >> >   const struct cpumask *blk_mq_online_queue_affinity(void)
> >> >   {
> >> > +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
> >> > +		cpumask_and(&blk_hk_online_mask, cpu_online_mask,
> >> > +			    housekeeping_cpumask(HK_TYPE_IO_QUEUE));
> >> > +		return &blk_hk_online_mask;
> >> 
> >> Can you explain the use of 'blk_hk_online_mask'?
> >> Why is a static variable?
> >
> > The blk_mq_*_queue_affinity helpers return a const struct cpumask *, the
> > caller doesn't need to free the return value. Because cpumask_and needs
> > store its result somewhere, I opted for the global static variable.
> >
> >> To my untrained eye it's being recalculated every time one calls
> >> this function. And only the first invocation run on an empty mask,
> >> all subsequent ones see a populated mask.
> >
> > The cpu_online_mask might change over time, it's not a static bitmap.
> > Thus it's necessary to update the blk_hk_online_mask. Doing some sort of
> > caching is certainly possible. Given that we have plenty of cpumask
> > logic operation in the cpu_group_evenly code path later, I am not so
> > sure this really makes a huge difference.
> 
> Sure,  but none of this is serialized against CPU hotplug operations. So
> the resulting mask, which is handed into the spreading code can be
> concurrently modified. IOW it's not as const as the code claims.

Thanks for explaining.

In group_cpu_evenly:

	/*
	 * Make a local cache of 'cpu_present_mask', so the two stages
	 * spread can observe consistent 'cpu_present_mask' without holding
	 * cpu hotplug lock, then we can reduce deadlock risk with cpu
	 * hotplug code.
	 *
	 * Here CPU hotplug may happen when reading `cpu_present_mask`, and
	 * we can live with the case because it only affects that hotplug
	 * CPU is handled in the 1st or 2nd stage, and either way is correct
	 * from API user viewpoint since 2-stage spread is sort of
	 * optimization.
	 */
	cpumask_copy(npresmsk, data_race(cpu_present_mask));


0263f92fadbb ("lib/group_cpus.c: avoid acquiring cpu hotplug lock in
group_cpus_evenly"):

  group_cpus_evenly() could be part of storage driver's error handler, such
  as nvme driver, when may happen during CPU hotplug, in which storage queue
  has to drain its pending IOs because all CPUs associated with the queue
  are offline and the queue is becoming inactive.  And handling IO needs
  error handler to provide forward progress.

  Then deadlock is caused:

  1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's
     handler is waiting for inflight IO

  2) error handler is waiting for CPU hotplug lock

  3) inflight IO can't be completed in blk-mq's CPU hotplug handler
     because error handling can't provide forward progress.

  Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(),
  in which two stage spreads are taken: 1) the 1st stage is over all present
  CPUs; 2) the end stage is over all other CPUs.

  Turns out the two stage spread just needs consistent 'cpu_present_mask',
  and remove the CPU hotplug lock by storing it into one local cache.  This
  way doesn't change correctness, because all CPUs are still covered.

This sounds like I should do something similar with cpu_online_mask.
Anyway, I'll work on this.

> How is this even remotely correct?

It isn't :( I did hotplug tests but obviously these were not really up
to the task. The kernel test bot gave me a pointer how I should test.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-09-12  8:32         ` Daniel Wagner
@ 2025-09-12 14:31           ` Thomas Gleixner
  0 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2025-09-12 14:31 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Hannes Reinecke, Daniel Wagner, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Michael S. Tsirkin,
	Aaron Tomlin, Martin K. Petersen, Costa Shulyupin, Juri Lelli,
	Valentin Schneider, Waiman Long, Ming Lei, Frederic Weisbecker,
	Mel Gorman, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream

On Fri, Sep 12 2025 at 10:32, Daniel Wagner wrote:
> On Wed, Sep 10, 2025 at 10:20:26AM +0200, Thomas Gleixner wrote:
>> > The cpu_online_mask might change over time, it's not a static bitmap.
>> > Thus it's necessary to update the blk_hk_online_mask. Doing some sort of
>> > caching is certainly possible. Given that we have plenty of cpumask
>> > logic operation in the cpu_group_evenly code path later, I am not so
>> > sure this really makes a huge difference.
>> 
>> Sure,  but none of this is serialized against CPU hotplug operations. So
>> the resulting mask, which is handed into the spreading code can be
>> concurrently modified. IOW it's not as const as the code claims.
>
> Thanks for explaining.
>
> In group_cpu_evenly:
>
> 	/*
> 	 * Make a local cache of 'cpu_present_mask', so the two stages
> 	 * spread can observe consistent 'cpu_present_mask' without holding
> 	 * cpu hotplug lock, then we can reduce deadlock risk with cpu
> 	 * hotplug code.
> 	 *
> 	 * Here CPU hotplug may happen when reading `cpu_present_mask`, and
> 	 * we can live with the case because it only affects that hotplug
> 	 * CPU is handled in the 1st or 2nd stage, and either way is correct
> 	 * from API user viewpoint since 2-stage spread is sort of
> 	 * optimization.
> 	 */
> 	cpumask_copy(npresmsk, data_race(cpu_present_mask));

The present mask is very different from the online mask. The present
mask only changes on physical hotplug when:

     - a offline CPU is removed from the present set of CPUs

     - a offline CPU is added to it.

In neither case the CPU can be involved in any operation related to the
actual offline/online operations.

Also contrary to your approach, this code takes the possibility of
a concurrently changing mask into account by taking a racy snapshot,
which is immutable for the following operation.

What you are doing with that static mask, makes it a target of
concurrent modification, which is obviously a recipe for subtle bugs.

>   Turns out the two stage spread just needs consistent 'cpu_present_mask',
>   and remove the CPU hotplug lock by storing it into one local cache.  This
>   way doesn't change correctness, because all CPUs are still covered.
>
> This sounds like I should do something similar with cpu_online_mask.

Indeed.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-09-05 14:59 ` [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
  2025-09-08  6:13   ` Hannes Reinecke
@ 2025-09-08  7:36   ` Daniel Wagner
  2025-09-08 13:05     ` Daniel Wagner
  2025-09-10  6:05   ` kernel test robot
  2 siblings, 1 reply; 29+ messages in thread
From: Daniel Wagner @ 2025-09-08  7:36 UTC (permalink / raw)
  To: linux-nvme
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin, Aaron Tomlin, Martin K. Petersen,
	Thomas Gleixner, Costa Shulyupin, Juri Lelli, Valentin Schneider,
	Waiman Long, Ming Lei, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Mathieu Desnoyers, linux-kernel, linux-block,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On Fri, Sep 05, 2025 at 04:59:56PM +0200, Daniel Wagner wrote:
>  void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
>  {
> -	const struct cpumask *masks;
> +	struct cpumask *masks __free(kfree) = NULL;
> +	const struct cpumask *constraint;
>  	unsigned int queue, cpu, nr_masks;
> +	cpumask_var_t active_hctx;
>  
> +	if (!zalloc_cpumask_var(&active_hctx, GFP_KERNEL))
> +		goto fallback;
> +

> +	free_cpumask_var(active_hctx);
> +
> +	return;
> +
> +free_fallback:
> +	free_cpumask_var(active_hctx);
> +
> +fallback:
> +	blk_mq_map_fallback(qmap);

I am not so happy that the cpumask_var_t and __free doesn't work to
together at this point due to the 'evil way' how cpumask_var_t is defined:

	ifdef CONFIG_CPUMASK_OFFSTACK
	typedef struct cpumask *cpumask_var_t;
	#else
	typedef struct cpumask cpumask_var_t[1];
	#endif /* CONFIG_CPUMASK_OFFSTACK */

In the previous version I used

	cpumask_var_t active_hctx __free(free_cpumask_var) = NULL;

which resulted in a way cleaner code. Though the kernel test robot
complained with

      >> block/blk-mq-cpumap.c:155:16: error: array initializer must be an initializer list
           155 |         cpumask_var_t active_hctx __free(free_cpumask_var) = NULL;

I try to figure out if it's possible to get this somehow working with
some witchcraft (aka pre compiler magic).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-09-08  7:36   ` Daniel Wagner
@ 2025-09-08 13:05     ` Daniel Wagner
  0 siblings, 0 replies; 29+ messages in thread
From: Daniel Wagner @ 2025-09-08 13:05 UTC (permalink / raw)
  To: linux-nvme
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin, Aaron Tomlin, Martin K. Petersen,
	Thomas Gleixner, Costa Shulyupin, Juri Lelli, Valentin Schneider,
	Waiman Long, Ming Lei, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Mathieu Desnoyers, linux-kernel, linux-block,
	megaraidlinux.pdl, linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream

On Mon, Sep 08, 2025 at 09:36:35AM +0200, Daniel Wagner wrote:
> which resulted in a way cleaner code. Though the kernel test robot
> complained with
> 
>       >> block/blk-mq-cpumap.c:155:16: error: array initializer must be an initializer list
>            155 |         cpumask_var_t active_hctx __free(free_cpumask_var) = NULL;
> 
> I try to figure out if it's possible to get this somehow working with
> some witchcraft (aka pre compiler magic).

What about adding something like this here:


#ifdef CONFIG_CPUMASK_OFFSTACK

#define scoped_cpumask_var(_name)			\
	cpumask_var_t _name __free(free_cpumask_var) = NULL;

#else /* ! CONFIG_CPUMASK_OFFSTACK */

#define scoped_cpumask_var(_name)			\
	cpumask_var_t _name __free(free_cpumask_var);

#endif /* ! CONFIG_CPUMASK_OFFSTACK */


This would make the new code way cleaner.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2025-09-05 14:59 ` [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
  2025-09-08  6:13   ` Hannes Reinecke
  2025-09-08  7:36   ` Daniel Wagner
@ 2025-09-10  6:05   ` kernel test robot
  2 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2025-09-10  6:05 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: oe-lkp, lkp, linux-block, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Michael S. Tsirkin,
	Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, linux-kernel, linux-nvme, megaraidlinux.pdl,
	linux-scsi, storagedev, virtualization,
	GR-QLogic-Storage-Upstream, Daniel Wagner, oliver.sang



Hello,

kernel test robot noticed "BUG:kernel_NULL_pointer_dereference,address" on:

commit: d918b4998cfeebf2116443c533f7e3e593658465 ("[PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled")
url: https://github.com/intel-lab-lkp/linux/commits/Daniel-Wagner/scsi-aacraid-use-block-layer-helpers-to-calculate-num-of-queues/20250905-230949
patch link: https://lore.kernel.org/all/20250905-isolcpus-io-queues-v8-10-885984c5daca@kernel.org/
patch subject: [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled

in testcase: rcutorture
version: 
with following parameters:

	runtime: 300s
	test: cpuhotplug
	torture_type: tasks-rude



config: i386-randconfig-017-20250909
compiler: clang-20
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+---------------------------------------------+------------+------------+
|                                             | 0365b94791 | d918b4998c |
+---------------------------------------------+------------+------------+
| boot_successes                              | 12         | 0          |
| boot_failures                               | 0          | 15         |
| Mem-Info                                    | 0          | 15         |
| BUG:kernel_NULL_pointer_dereference,address | 0          | 15         |
| Oops                                        | 0          | 15         |
| EIP:__blk_mq_all_tag_iter                   | 0          | 15         |
| Kernel_panic-not_syncing:Fatal_exception    | 0          | 15         |
+---------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202509101342.a803ecaa-lkp@intel.com


[  874.700557][   T21] BUG: kernel NULL pointer dereference, address: 00000004
[  874.701560][   T21] #PF: supervisor read access in kernel mode
[  874.702264][   T21] #PF: error_code(0x0000) - not-present page
[  874.702940][   T21] *pde = 00000000
[  874.703513][   T21] Oops: Oops: 0000 [#1] SMP
[  874.704091][   T21] CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Tainted: G S                  6.17.0-rc4-00010-gd918b4998cfe #1 NONE
[  874.705003][   T21] Tainted: [S]=CPU_OUT_OF_SPEC
[  874.705657][   T21] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 874.706497][ T21] EIP: __blk_mq_all_tag_iter (block/blk-mq-tag.c:399) 
[ 874.707121][ T21] Code: c9 6a 00 e8 d8 4f 94 ff 83 c4 04 89 da 83 e2 01 74 02 0f 0b 8b 5d 08 b8 30 7c 33 45 31 c9 6a 00 e8 bb 4f 94 ff 89 d9 83 c4 04 <83> 7e 04 00 8b 5d 0c 74 2e 89 d8 83 c8 01 89 75 e4 89 7d e8 89 4d
All code
========
   0:	c9                   	leave
   1:	6a 00                	push   $0x0
   3:	e8 d8 4f 94 ff       	call   0xffffffffff944fe0
   8:	83 c4 04             	add    $0x4,%esp
   b:	89 da                	mov    %ebx,%edx
   d:	83 e2 01             	and    $0x1,%edx
  10:	74 02                	je     0x14
  12:	0f 0b                	ud2
  14:	8b 5d 08             	mov    0x8(%rbp),%ebx
  17:	b8 30 7c 33 45       	mov    $0x45337c30,%eax
  1c:	31 c9                	xor    %ecx,%ecx
  1e:	6a 00                	push   $0x0
  20:	e8 bb 4f 94 ff       	call   0xffffffffff944fe0
  25:	89 d9                	mov    %ebx,%ecx
  27:	83 c4 04             	add    $0x4,%esp
  2a:*	83 7e 04 00          	cmpl   $0x0,0x4(%rsi)		<-- trapping instruction
  2e:	8b 5d 0c             	mov    0xc(%rbp),%ebx
  31:	74 2e                	je     0x61
  33:	89 d8                	mov    %ebx,%eax
  35:	83 c8 01             	or     $0x1,%eax
  38:	89 75 e4             	mov    %esi,-0x1c(%rbp)
  3b:	89 7d e8             	mov    %edi,-0x18(%rbp)
  3e:	89                   	.byte 0x89
  3f:	4d                   	rex.WRB

Code starting with the faulting instruction
===========================================
   0:	83 7e 04 00          	cmpl   $0x0,0x4(%rsi)
   4:	8b 5d 0c             	mov    0xc(%rbp),%ebx
   7:	74 2e                	je     0x37
   9:	89 d8                	mov    %ebx,%eax
   b:	83 c8 01             	or     $0x1,%eax
   e:	89 75 e4             	mov    %esi,-0x1c(%rbp)
  11:	89 7d e8             	mov    %edi,-0x18(%rbp)
  14:	89                   	.byte 0x89
  15:	4d                   	rex.WRB
[  874.708716][   T21] EAX: 00000000 EBX: 4632deb8 ECX: 4632deb8 EDX: 00000000
[  874.709385][   T21] ESI: 00000000 EDI: 4192ace0 EBP: 4632de9c ESP: 4632de80
[  874.710046][   T21] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010212
[  874.710741][   T21] CR0: 80050033 CR2: 00000004 CR3: 158ad000 CR4: 00040690
[  874.711424][   T21] Call Trace:
[ 874.711911][ T21] ? blk_mq_all_tag_iter (block/blk-mq-tag.c:420) 
[ 874.712479][ T21] ? blk_mq_hctx_notify_offline (block/blk-mq.c:3736) 
[ 874.713083][ T21] ? blk_mq_hctx_notify_online (block/blk-mq.c:3713) 
[ 874.713672][ T21] ? cpuhp_invoke_callback (kernel/cpu.c:217) 
[ 874.714273][ T21] ? blk_mq_hctx_notify_online (block/blk-mq.c:3713) 
[ 874.714861][ T21] ? cpuhp_thread_fun (kernel/cpu.c:1105) 
[ 874.715433][ T21] ? smpboot_thread_fn (kernel/smpboot.c:?) 
[ 874.716005][ T21] ? kthread (kernel/kthread.c:465) 
[ 874.716528][ T21] ? smpboot_unregister_percpu_thread (kernel/smpboot.c:103) 
[ 874.717144][ T21] ? __do_trace_sched_kthread_stop_ret (kernel/kthread.c:412) 
[ 874.717763][ T21] ? __do_trace_sched_kthread_stop_ret (kernel/kthread.c:412) 
[ 874.718378][ T21] ? ret_from_fork (arch/x86/kernel/process.c:154) 
[ 874.718945][ T21] ? __do_trace_sched_kthread_stop_ret (kernel/kthread.c:412) 
[ 874.719574][ T21] ? ret_from_fork_asm (arch/x86/entry/entry_32.S:737) 
[ 874.720128][ T21] ? entry_INT80_32 (arch/x86/entry/entry_32.S:945) 
[  874.720667][   T21] Modules linked in: rcutorture torture
[  874.721260][   T21] CR2: 0000000000000004
[  874.721773][   T21] ---[ end trace 0000000000000000 ]---
[ 874.722424][ T21] EIP: __blk_mq_all_tag_iter (block/blk-mq-tag.c:399) 
[ 874.723094][ T21] Code: c9 6a 00 e8 d8 4f 94 ff 83 c4 04 89 da 83 e2 01 74 02 0f 0b 8b 5d 08 b8 30 7c 33 45 31 c9 6a 00 e8 bb 4f 94 ff 89 d9 83 c4 04 <83> 7e 04 00 8b 5d 0c 74 2e 89 d8 83 c8 01 89 75 e4 89 7d e8 89 4d
All code
========
   0:	c9                   	leave
   1:	6a 00                	push   $0x0
   3:	e8 d8 4f 94 ff       	call   0xffffffffff944fe0
   8:	83 c4 04             	add    $0x4,%esp
   b:	89 da                	mov    %ebx,%edx
   d:	83 e2 01             	and    $0x1,%edx
  10:	74 02                	je     0x14
  12:	0f 0b                	ud2
  14:	8b 5d 08             	mov    0x8(%rbp),%ebx
  17:	b8 30 7c 33 45       	mov    $0x45337c30,%eax
  1c:	31 c9                	xor    %ecx,%ecx
  1e:	6a 00                	push   $0x0
  20:	e8 bb 4f 94 ff       	call   0xffffffffff944fe0
  25:	89 d9                	mov    %ebx,%ecx
  27:	83 c4 04             	add    $0x4,%esp
  2a:*	83 7e 04 00          	cmpl   $0x0,0x4(%rsi)		<-- trapping instruction
  2e:	8b 5d 0c             	mov    0xc(%rbp),%ebx
  31:	74 2e                	je     0x61
  33:	89 d8                	mov    %ebx,%eax
  35:	83 c8 01             	or     $0x1,%eax
  38:	89 75 e4             	mov    %esi,-0x1c(%rbp)
  3b:	89 7d e8             	mov    %edi,-0x18(%rbp)
  3e:	89                   	.byte 0x89
  3f:	4d                   	rex.WRB

Code starting with the faulting instruction
===========================================
   0:	83 7e 04 00          	cmpl   $0x0,0x4(%rsi)
   4:	8b 5d 0c             	mov    0xc(%rbp),%ebx
   7:	74 2e                	je     0x37
   9:	89 d8                	mov    %ebx,%eax
   b:	83 c8 01             	or     $0x1,%eax
   e:	89 75 e4             	mov    %esi,-0x1c(%rbp)
  11:	89 7d e8             	mov    %edi,-0x18(%rbp)
  14:	89                   	.byte 0x89
  15:	4d                   	rex.WRB


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250910/202509101342.a803ecaa-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v8 11/12] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
  2025-09-05 14:59 [PATCH v8 00/12] blk: honor isolcpus configuration Daniel Wagner
                   ` (9 preceding siblings ...)
  2025-09-05 14:59 ` [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
@ 2025-09-05 14:59 ` Daniel Wagner
  2025-09-05 14:59 ` [PATCH v8 12/12] docs: add io_queue flag to isolcpus Daniel Wagner
  2026-03-25 17:56 ` [PATCH v8 00/12] blk: honor isolcpus configuration Sebastian Andrzej Siewior
  12 siblings, 0 replies; 29+ messages in thread
From: Daniel Wagner @ 2025-09-05 14:59 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, Aaron Tomlin, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

When isolcpus=io_queue is enabled, and the last housekeeping CPU for a
given hctx goes offline, there would be no CPU left to handle I/O. To
prevent I/O stalls, prevent offlining housekeeping CPUs that are still
serving isolated CPUs.

When isolcpus=io_queue is enabled and the last housekeeping CPU
for a given hctx goes offline, no CPU would be left to handle I/O.
To prevent I/O stalls, disallow offlining housekeeping CPUs that are
still serving isolated CPUs.

Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ba3a4b77f5786e5372adce53e4fff5aa2ace24aa..d48be77919e671a81077f7042103699a80959664 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3683,6 +3683,43 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
 	return data.has_rq;
 }
 
+static bool blk_mq_hctx_can_offline_hk_cpu(struct blk_mq_hw_ctx *hctx,
+					   unsigned int this_cpu)
+{
+	const struct cpumask *hk_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+
+	for (int i = 0; i < hctx->nr_ctx; i++) {
+		struct blk_mq_ctx *ctx = hctx->ctxs[i];
+
+		if (ctx->cpu == this_cpu)
+			continue;
+
+		/*
+		 * Check if this context has at least one online
+		 * housekeeping CPU; in this case the hardware context is
+		 * usable.
+		 */
+		if (cpumask_test_cpu(ctx->cpu, hk_mask) &&
+		    cpu_online(ctx->cpu))
+			break;
+
+		/*
+		 * The context doesn't have any online housekeeping CPUs,
+		 * but there might be an online isolated CPU mapped to
+		 * it.
+		 */
+		if (cpu_is_offline(ctx->cpu))
+			continue;
+
+		pr_warn("%s: trying to offline hctx%d but there is still an online isolcpu CPU %d mapped to it\n",
+			hctx->queue->disk->disk_name,
+			hctx->queue_num, ctx->cpu);
+		return false;
+	}
+
+	return true;
+}
+
 static bool blk_mq_hctx_has_online_cpu(struct blk_mq_hw_ctx *hctx,
 		unsigned int this_cpu)
 {
@@ -3714,6 +3751,11 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
 			struct blk_mq_hw_ctx, cpuhp_online);
 
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
+		if (!blk_mq_hctx_can_offline_hk_cpu(hctx, cpu))
+			return -EINVAL;
+	}
+
 	if (blk_mq_hctx_has_online_cpu(hctx, cpu))
 		return 0;
 

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 12/12] docs: add io_queue flag to isolcpus
  2025-09-05 14:59 [PATCH v8 00/12] blk: honor isolcpus configuration Daniel Wagner
                   ` (10 preceding siblings ...)
  2025-09-05 14:59 ` [PATCH v8 11/12] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Daniel Wagner
@ 2025-09-05 14:59 ` Daniel Wagner
  2026-03-25 17:56 ` [PATCH v8 00/12] blk: honor isolcpus configuration Sebastian Andrzej Siewior
  12 siblings, 0 replies; 29+ messages in thread
From: Daniel Wagner @ 2025-09-05 14:59 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin
  Cc: Aaron Tomlin, Martin K. Petersen, Thomas Gleixner,
	Costa Shulyupin, Juri Lelli, Valentin Schneider, Waiman Long,
	Ming Lei, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Mathieu Desnoyers, Aaron Tomlin, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream, Daniel Wagner

The io_queue flag informs multiqueue device drivers where to place
hardware queues. Document this new flag in the isolcpus
command-line argument description.

Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 Documentation/admin-guide/kernel-parameters.txt | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 747a55abf4946bb9efe320f0f62fdcd1560b0a71..4161d4277a7086f2a3726617826c50888eefb260 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2653,7 +2653,6 @@
 			  "number of CPUs in system - 1".
 
 			managed_irq
-
 			  Isolate from being targeted by managed interrupts
 			  which have an interrupt mask containing isolated
 			  CPUs. The affinity of managed interrupts is
@@ -2676,6 +2675,27 @@
 			  housekeeping CPUs has no influence on those
 			  queues.
 
+			io_queue
+			  Isolate from I/O queue work caused by multiqueue
+			  device drivers. Restrict the placement of
+			  queues to housekeeping CPUs only, ensuring that
+			  all I/O work is processed by a housekeeping CPU.
+
+			  The io_queue configuration takes precedence
+			  over managed_irq. When io_queue is used,
+			  managed_irq placement constrains have no
+			  effect.
+
+			  Note: Offlining housekeeping CPUS which serve
+			  isolated CPUs will be rejected. Isolated CPUs
+			  need to be offlined before offlining the
+			  housekeeping CPUs.
+
+			  Note: When an isolated CPU issues an I/O request,
+			  it is forwarded to a housekeeping CPU. This will
+			  trigger a software interrupt on the completion
+			  path.
+
 			The format of <cpu-list> is described above.
 
 	iucv=		[HW,NET]

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 00/12] blk: honor isolcpus configuration
  2025-09-05 14:59 [PATCH v8 00/12] blk: honor isolcpus configuration Daniel Wagner
                   ` (11 preceding siblings ...)
  2025-09-05 14:59 ` [PATCH v8 12/12] docs: add io_queue flag to isolcpus Daniel Wagner
@ 2026-03-25 17:56 ` Sebastian Andrzej Siewior
  2026-03-26  7:42   ` Sebastian Andrzej Siewior
  12 siblings, 1 reply; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-25 17:56 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin, Aaron Tomlin, Martin K. Petersen,
	Thomas Gleixner, Costa Shulyupin, Juri Lelli, Valentin Schneider,
	Waiman Long, Ming Lei, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream

On 2025-09-05 16:59:46 [+0200], Daniel Wagner wrote:
> The main changes in this version are
> 
>   - merged the mapping algorithm into the existing code
>   - dropping a bunch of SCSI drivers update
> 
> With the merging of the isolcpus-aware mapping code, there is a change in
> how the resulting CPU–hctx mapping looks for systems with identical CPUs
> (non-hyperthreaded CPUs). My understanding is that it shouldn't matter,
> but the devil is in the details.
…

I have been just made aware of this. It is still not merged so let me
ask the questions before it is too late.
What is purpose of managed_irq? Doesn't this sort of aligns with this?
It was originally introduce to limit the number of nvme interrupts which
can have multiple interrupts but don't need to use all of them. virtio
is using it.  Yet I don't see an affect if I use isolcpus=managed_irq,X.

I would expect that it limits the number of used interrupts as intended
by this series. The only affect I see is that once the CPU goes off/on
the assigned CPU mask is restored.

What do I miss?

Sebastian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 00/12] blk: honor isolcpus configuration
  2026-03-25 17:56 ` [PATCH v8 00/12] blk: honor isolcpus configuration Sebastian Andrzej Siewior
@ 2026-03-26  7:42   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-26  7:42 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Michael S. Tsirkin, Aaron Tomlin, Martin K. Petersen,
	Thomas Gleixner, Costa Shulyupin, Juri Lelli, Valentin Schneider,
	Waiman Long, Ming Lei, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Mathieu Desnoyers, linux-kernel, linux-block,
	linux-nvme, megaraidlinux.pdl, linux-scsi, storagedev,
	virtualization, GR-QLogic-Storage-Upstream

On 2026-03-25 18:56:13 [+0100], To Daniel Wagner wrote:
> I have been just made aware of this. It is still not merged so let me
> ask the questions before it is too late.
> What is purpose of managed_irq? Doesn't this sort of aligns with this?

No, it does not. I got the answers. Let me write that down. Move on then…

Sebastian

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2026-03-26  7:42 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-05 14:59 [PATCH v8 00/12] blk: honor isolcpus configuration Daniel Wagner
2025-09-05 14:59 ` [PATCH v8 01/12] scsi: aacraid: use block layer helpers to calculate num of queues Daniel Wagner
2025-09-08  6:06   ` Hannes Reinecke
2025-09-05 14:59 ` [PATCH v8 02/12] lib/group_cpus: remove dead !SMP code Daniel Wagner
2025-09-08  6:06   ` Hannes Reinecke
2025-09-05 14:59 ` [PATCH v8 03/12] lib/group_cpus: Add group_mask_cpus_evenly() Daniel Wagner
2025-09-05 14:59 ` [PATCH v8 04/12] genirq/affinity: Add cpumask to struct irq_affinity Daniel Wagner
2025-09-10  8:22   ` Thomas Gleixner
2025-09-05 14:59 ` [PATCH v8 05/12] blk-mq: add blk_mq_{online|possible}_queue_affinity Daniel Wagner
2025-09-05 14:59 ` [PATCH v8 06/12] nvme-pci: use block layer helpers to constrain queue affinity Daniel Wagner
2025-09-05 14:59 ` [PATCH v8 07/12] scsi: Use " Daniel Wagner
2025-09-08  6:08   ` Hannes Reinecke
2025-09-05 14:59 ` [PATCH v8 08/12] virtio: blk/scsi: use " Daniel Wagner
2025-09-05 14:59 ` [PATCH v8 09/12] isolation: Introduce io_queue isolcpus type Daniel Wagner
2025-09-05 14:59 ` [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
2025-09-08  6:13   ` Hannes Reinecke
2025-09-08  7:26     ` Daniel Wagner
2025-09-08  7:51       ` Hannes Reinecke
2025-09-08  8:08         ` Daniel Wagner
2025-09-10  8:20       ` Thomas Gleixner
2025-09-12  8:32         ` Daniel Wagner
2025-09-12 14:31           ` Thomas Gleixner
2025-09-08  7:36   ` Daniel Wagner
2025-09-08 13:05     ` Daniel Wagner
2025-09-10  6:05   ` kernel test robot
2025-09-05 14:59 ` [PATCH v8 11/12] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Daniel Wagner
2025-09-05 14:59 ` [PATCH v8 12/12] docs: add io_queue flag to isolcpus Daniel Wagner
2026-03-25 17:56 ` [PATCH v8 00/12] blk: honor isolcpus configuration Sebastian Andrzej Siewior
2026-03-26  7:42   ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox