[PATCH v9 00/13] blk: honor isolcpus configuration

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH v9 00/13] blk: honor isolcpus configuration
@ 2026-03-30 22:10 Aaron Tomlin
  2026-03-30 22:10 ` [PATCH v9 01/13] scsi: aacraid: use block layer helpers to calculate num of queues Aaron Tomlin
                   ` (13 more replies)
  0 siblings, 14 replies; 29+ messages in thread
From: Aaron Tomlin @ 2026-03-30 22:10 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, steve, sean, chjohnst, neelx, mproche, linux-block,
	linux-kernel, virtualization, linux-nvme, linux-scsi,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

Hi Jens, Keith, Christoph, Sagi, Michael,

I have decided to drive this series forward on behalf of Daniel Wagner, the
original author. This iteration addresses the outstanding architectural and
concurrency concerns raised during the previous review cycle, and the series
has been rebased on v7.0-rc5-509-g545475aebc2a.

Building upon prior iterations, this series introduces critical
architectural refinements to the mapping and affinity spreading algorithms
to guarantee thread safety and resilience against concurrent CPU-hotplug
operations. Previously, the block layer relied on a shared global static
mask (i.e., blk_hk_online_mask), which proved vulnerable to race conditions
during rapid hotplug events. This vulnerability was recently highlighted by
the kernel test robot, which encountered a NULL pointer dereference during
rcutorture (cpuhotplug) stress testing due to concurrent mask modification.

To resolve this, the architecture has been fundamentally hardened. The
global static state has been eradicated. Instead, the IRQ affinity core now
employs a newly introduced irq_spread_hk_filter(), which safely intersects
the natively calculated affinity mask with the HK_TYPE_IO_QUEUE mask.
Crucially, this is achieved using a local, hotplug-safe snapshot via
data_race(cpu_online_mask). This approach circumvents the hotplug lock
deadlocks previously identified by Thomas Gleixner, whilst explicitly
avoiding CONFIG_CPUMASK_OFFSTACK stack bloat hazards on high-core-count
systems. A robust fallback mechanism guarantees that should an interrupt
vector be assigned exclusively to isolated cores, it is safely re-routed to
the system's online housekeeping CPUs.

Please let me know your thoughts.


Changes in v9:
 - Added "Reviewed-by:" tags

 - Introduced irq_spread_hk_filter() to safely restrict managed IRQ
   affinity to housekeeping CPUs (Thomas Gleixner)

 - Removed the unsafe global static variable blk_hk_online_mask from
   blk-mq-cpumap.c and blk-mq.c. blk_mq_online_queue_affinity() now returns
   a stable pointer, delegating safe intersection to the callers to prevent
   concurrent modification races (Thomas Gleixner, Hannes Reinecke)

 - Resolved BUG: kernel NULL pointer dereference in __blk_mq_all_tag_iter
   reported by the kernel test robot during cpuhotplug rcutorture stress
   testing

 - Linked to v8: https://lore.kernel.org/lkml/20250905-isolcpus-io-queues-v8-0-885984c5daca@kernel.org/

Changes in v8:

 - Added commit 524f5eea4bbe ("lib/group_cpus: remove !SMP code")

 - Merged the new mapping logic directly into the existing function to
   avoid special casing

 - Refined the group_mask_cpus_evenly() implementation with the following
   updates:

   - Corrected the function name typo (changed group_masks_cpus_evenly to
     group_mask_cpus_evenly)

   - Updated the documentation comment to accurately reflect the function's
     behavior

   - Renamed the cpu_mask argument to mask for consistency

 - Added a new patch for aacraid to include the missing number of queues
   calculation

 - Restricted updates to only affect SCSI drivers that support
   PCI_IRQ_AFFINITY and do not utilize nvme-fabrics

 - Removed the __free cleanup attribute usage for cpumask_var_t allocations
   due to compatibility issues

 - Updated the documentation to explicitly highlight the limitations
   surrounding CPU offlining

 - Collected accumulated Reviewed-by and Acked-by tags

 - Linked to v7: https://patch.msgid.link/20250702-isolcpus-io-queues-v7-0-557aa7eacce4@kernel.org

Changes in v7:

 - Sent out the first part of the series independently:
   https://lore.kernel.org/all/20250617-isolcpus-queue-counters-v1-0-13923686b54b@kernel.org/

 - Added comprehensive kernel command-line documentation

 - Added validation logic to ensure the resulting CPU-to-queue mapping is
   fully operational

 - Rewrote the isolcpus mapping code to properly account for active
   hardware contexts (hctx)

 - Introduced blk_mq_map_hk_irq_queues, which utilizes the mask retrieved
   from irq_get_affinity()

 - Refactored blk_mq_map_hk_queues to require the caller to explicitly test
   for HK_TYPE_MANAGED_IRQ

 - Linked to v6: https://patch.msgid.link/20250424-isolcpus-io-queues-v6-0-9a53a870ca1f@kernel.org

Changes in v6:

 - Reintroduced the io_queue type for the isolcpus kernel parameter

 - Prevented the offlining of a housekeeping CPU if an isolated CPU is
   still present, upgrading this behavior from a simple warning to a hard
   restriction

 - Linked to v5: https://lore.kernel.org/r/20250110-isolcpus-io-queues-v5-0-0e4f118680b0@kernel.org

Changes in v5:

 - Rebased the series onto the latest for-6.14/block branch.

 - Updated the documentation regarding the managed_irq parameters

 - Reworded the commit message for "blk-mq: issue warning when offlining
   hctx with online isolcpus" for better clarity

 - Split the input and output parameters in the patch "lib/group_cpus: let
   group_cpu_evenly return number of groups"

 - Dropped the patch "sched/isolation: document HK_TYPE housekeeping
   option"

 - Linked to v4: https://lore.kernel.org/r/20241217-isolcpus-io-queues-v4-0-5d355fbb1e14@kernel.org

Changes in v4:

 - Added the patch "blk-mq: issue warning when offlining hctx with online
   isolcpus"

 - Fixed the check in group_cpus_evenly(); the condition now properly uses
   housekeeping_enabled() instead of cpumask_weight(), as the latter always
   returns a valid mask

 - Dropped the Fixes: tag from "lib/group_cpus.c: honor housekeeping config
   when grouping CPUs"

 - Fixed an overlong line warning in the patch "scsi: use block layer
   helpers to calculate num of queues"

 - Dropped the patch "sched/isolation: Add io_queue housekeeping option" in
   favor of simply documenting the housekeeping hk_type enum

 - Added the patch "lib/group_cpus: let group_cpu_evenly return number of
   groups"

 - Collected accumulated Reviewed-by and Acked-by tags

 - Split the patchset by moving foundational changes into a separate
   preparation series:
   https://lore.kernel.org/linux-nvme/20241202-refactor-blk-affinity-helpers-v6-0-27211e9c2cd5@kernel.org/

 - Linked to v3: https://lore.kernel.org/r/20240806-isolcpus-io-queues-v3-0-da0eecfeaf8b@suse.de

Changes in v3:

 - Integrated patches from Ming Lei
   (https://lore.kernel.org/all/20210709081005.421340-1-ming.lei@redhat.com/):
   "virtio: add APIs for retrieving vq affinity" and "blk-mq: introduce
   blk_mq_dev_map_queues"

 - Replaced all instances of blk_mq_pci_map_queues and
   blk_mq_virtio_map_queues with the new unified blk_mq_dev_map_queues

 - Updated and expanded the helper functions used for calculating the
   number of queues

 - Added the CPU-to-hctx mapping function specifically to support the
   isolcpus=io_queue parameter

 - Documented the hk_type enum and the newly introduced isolcpus=io_queue
   parameter

 - Added the patch "scsi: pm8001: do not overwrite PCI queue mapping"

 - Linked to v2: https://lore.kernel.org/r/20240627-isolcpus-io-queues-v2-0-26a32e3c4f75@suse.de

Changes in v2:

 - Updated the feature documentation for clarity and completeness

 - Split the blk/nvme-pci patch into smaller, logical commits

 - Dropped the HK_TYPE_IO_QUEUE macro in favor of reusing
   HK_TYPE_MANAGED_IRQ

 - Linked to v1: https://lore.kernel.org/r/20240621-isolcpus-io-queues-v1-0-8b169bf41083@suse.de


Aaron Tomlin (1):
  genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs

Daniel Wagner (12):
  scsi: aacraid: use block layer helpers to calculate num of queues
  lib/group_cpus: remove dead !SMP code
  lib/group_cpus: Add group_mask_cpus_evenly()
  genirq/affinity: Add cpumask to struct irq_affinity
  blk-mq: add blk_mq_{online|possible}_queue_affinity
  nvme-pci: use block layer helpers to constrain queue affinity
  scsi: Use block layer helpers to constrain queue affinity
  virtio: blk/scsi: use block layer helpers to constrain queue affinity
  isolation: Introduce io_queue isolcpus type
  blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
  docs: add io_queue flag to isolcpus

 .../admin-guide/kernel-parameters.txt         |  22 +-
 block/blk-mq-cpumap.c                         | 201 ++++++++++++++++--
 block/blk-mq.c                                |  42 ++++
 drivers/block/virtio_blk.c                    |   4 +-
 drivers/nvme/host/pci.c                       |   1 +
 drivers/scsi/aacraid/comminit.c               |   3 +-
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c        |   1 +
 drivers/scsi/megaraid/megaraid_sas_base.c     |   5 +-
 drivers/scsi/mpi3mr/mpi3mr_fw.c               |   6 +-
 drivers/scsi/mpt3sas/mpt3sas_base.c           |   5 +-
 drivers/scsi/pm8001/pm8001_init.c             |   1 +
 drivers/scsi/virtio_scsi.c                    |   5 +-
 include/linux/blk-mq.h                        |   2 +
 include/linux/group_cpus.h                    |   3 +
 include/linux/interrupt.h                     |  16 +-
 include/linux/sched/isolation.h               |   1 +
 kernel/irq/affinity.c                         |  38 +++-
 kernel/sched/isolation.c                      |   7 +
 lib/group_cpus.c                              |  65 ++++--
 19 files changed, 379 insertions(+), 49 deletions(-)


base-commit: 545475aebc2a2e8df14fadc911a7a2d03ddd6a1f
-- 
2.51.0



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v9 01/13] scsi: aacraid: use block layer helpers to calculate num of queues
  2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
@ 2026-03-30 22:10 ` Aaron Tomlin
  2026-03-30 22:10 ` [PATCH v9 02/13] lib/group_cpus: remove dead !SMP code Aaron Tomlin
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Aaron Tomlin @ 2026-03-30 22:10 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, steve, sean, chjohnst, neelx, mproche, linux-block,
	linux-kernel, virtualization, linux-nvme, linux-scsi,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

From: Daniel Wagner <wagi@kernel.org>

The calculation of the upper limit for queues does not depend solely on
the number of online CPUs; for example, the isolcpus kernel
command-line option must also be considered.

To account for this, the block layer provides a helper function to
retrieve the maximum number of queues. Use it to set an appropriate
upper queue number limit.

Fixes: 94970cfb5f10 ("scsi: use block layer helpers to calculate num of queues")
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/aacraid/comminit.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/scsi/aacraid/comminit.c b/drivers/scsi/aacraid/comminit.c
index 9bd3f5b868bc..ec165b57182d 100644
--- a/drivers/scsi/aacraid/comminit.c
+++ b/drivers/scsi/aacraid/comminit.c
@@ -469,8 +469,7 @@ void aac_define_int_mode(struct aac_dev *dev)
 	}
 
 	/* Don't bother allocating more MSI-X vectors than cpus */
-	msi_count = min(dev->max_msix,
-		(unsigned int)num_online_cpus());
+	msi_count = blk_mq_num_online_queues(dev->max_msix);
 
 	dev->max_msix = msi_count;
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 02/13] lib/group_cpus: remove dead !SMP code
  2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
  2026-03-30 22:10 ` [PATCH v9 01/13] scsi: aacraid: use block layer helpers to calculate num of queues Aaron Tomlin
@ 2026-03-30 22:10 ` Aaron Tomlin
  2026-04-01 12:29   ` Sebastian Andrzej Siewior
  2026-03-30 22:10 ` [PATCH v9 03/13] lib/group_cpus: Add group_mask_cpus_evenly() Aaron Tomlin
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 29+ messages in thread
From: Aaron Tomlin @ 2026-03-30 22:10 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, steve, sean, chjohnst, neelx, mproche, linux-block,
	linux-kernel, virtualization, linux-nvme, linux-scsi,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

From: Daniel Wagner <wagi@kernel.org>

The support for the !SMP configuration has been removed from the core by
commit cac5cefbade9 ("sched/smp: Make SMP unconditional").

Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 lib/group_cpus.c | 20 --------------------
 1 file changed, 20 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index e6e18d7a49bb..b8d54398f88a 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -9,8 +9,6 @@
 #include <linux/sort.h>
 #include <linux/group_cpus.h>
 
-#ifdef CONFIG_SMP
-
 static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 				unsigned int cpus_per_grp)
 {
@@ -564,22 +562,4 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
 	*nummasks = min(nr_present + nr_others, numgrps);
 	return masks;
 }
-#else /* CONFIG_SMP */
-struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
-{
-	struct cpumask *masks;
-
-	if (numgrps == 0)
-		return NULL;
-
-	masks = kzalloc_objs(*masks, numgrps);
-	if (!masks)
-		return NULL;
-
-	/* assign all CPUs(cpu 0) to the 1st group only */
-	cpumask_copy(&masks[0], cpu_possible_mask);
-	*nummasks = 1;
-	return masks;
-}
-#endif /* CONFIG_SMP */
 EXPORT_SYMBOL_GPL(group_cpus_evenly);
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 02/13] lib/group_cpus: remove dead !SMP code
  2026-03-30 22:10 ` [PATCH v9 02/13] lib/group_cpus: remove dead !SMP code Aaron Tomlin
@ 2026-04-01 12:29   ` Sebastian Andrzej Siewior
  2026-04-01 19:31     ` Aaron Tomlin
  0 siblings, 1 reply; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-04-01 12:29 UTC (permalink / raw)
  To: Aaron Tomlin
  Cc: axboe, kbusch, hch, sagi, mst, aacraid, James.Bottomley,
	martin.petersen, liyihang9, kashyap.desai, sumit.saxena,
	shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
	sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
	jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
	akpm, maz, ruanjinjie, yphbchou0911, wagi, frederic, longman,
	chenridong, hare, kch, ming.lei, steve, sean, chjohnst, neelx,
	mproche, linux-block, linux-kernel, virtualization, linux-nvme,
	linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On 2026-03-30 18:10:36 [-0400], Aaron Tomlin wrote:
> From: Daniel Wagner <wagi@kernel.org>
> 
> The support for the !SMP configuration has been removed from the core by
> commit cac5cefbade9 ("sched/smp: Make SMP unconditional").

!SMP is not dead code here. You can very much compile a !SMP kernel at
which point the code below will be used. It is more that the sched
department decided that scheduler's maintenance will be easier since we
don't have to deal with !SMP case anymore.

If you wish to remove the !SMP case here you need to argue as such.

> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> Reviewed-by: Hannes Reinecke <hare@suse.de>

The previous patch, this one and probably the following lack a
Signed-off-by line with your name.

Sebastian


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 02/13] lib/group_cpus: remove dead !SMP code
  2026-04-01 12:29   ` Sebastian Andrzej Siewior
@ 2026-04-01 19:31     ` Aaron Tomlin
  0 siblings, 0 replies; 29+ messages in thread
From: Aaron Tomlin @ 2026-04-01 19:31 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: axboe, kbusch, hch, sagi, mst, aacraid, James.Bottomley,
	martin.petersen, liyihang9, kashyap.desai, sumit.saxena,
	shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
	sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
	jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
	akpm, maz, ruanjinjie, yphbchou0911, wagi, frederic, longman,
	chenridong, hare, kch, ming.lei, steve, sean, chjohnst, neelx,
	mproche, linux-block, linux-kernel, virtualization, linux-nvme,
	linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

[-- Attachment #1: Type: text/plain, Size: 1600 bytes --]

On Wed, Apr 01, 2026 at 02:29:08PM +0200, Sebastian Andrzej Siewior wrote:
> On 2026-03-30 18:10:36 [-0400], Aaron Tomlin wrote:
> > From: Daniel Wagner <wagi@kernel.org>
> > 
> > The support for the !SMP configuration has been removed from the core by
> > commit cac5cefbade9 ("sched/smp: Make SMP unconditional").
> 
> !SMP is not dead code here. You can very much compile a !SMP kernel at
> which point the code below will be used. It is more that the sched
> department decided that scheduler's maintenance will be easier since we
> don't have to deal with !SMP case anymore.
> 
> If you wish to remove the !SMP case here you need to argue as such.
> 
> > Signed-off-by: Daniel Wagner <wagi@kernel.org>
> > Reviewed-by: Hannes Reinecke <hare@suse.de>
> 
> The previous patch, this one and probably the following lack a
> Signed-off-by line with your name.
> 
Hi Sebastian,

Thank you for your review and for clarifying the precise nature of the !SMP
configuration.

Fair point. Describing it broadly as "dead code" is indeed inaccurate,
given that one can certainly still compile a uniprocessor kernel. I shall
amend the commit message in the next series to properly articulate the
reasoning: we are removing this specific !SMP fallback logic because the
core scheduler now mandates SMP unconditionally (as per commit
cac5cefbade9), rendering this particular handling redundant for our
purposes, rather than it being globally obsolete.

Furthermore, my apologies for the oversight regarding the missing
Signed-off-by tags.


Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v9 03/13] lib/group_cpus: Add group_mask_cpus_evenly()
  2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
  2026-03-30 22:10 ` [PATCH v9 01/13] scsi: aacraid: use block layer helpers to calculate num of queues Aaron Tomlin
  2026-03-30 22:10 ` [PATCH v9 02/13] lib/group_cpus: remove dead !SMP code Aaron Tomlin
@ 2026-03-30 22:10 ` Aaron Tomlin
  2026-03-30 22:10 ` [PATCH v9 04/13] genirq/affinity: Add cpumask to struct irq_affinity Aaron Tomlin
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Aaron Tomlin @ 2026-03-30 22:10 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, steve, sean, chjohnst, neelx, mproche, linux-block,
	linux-kernel, virtualization, linux-nvme, linux-scsi,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

From: Daniel Wagner <wagi@kernel.org>

group_mask_cpu_evenly() allows the caller to pass in a CPU mask that
should be evenly distributed. This new function is a more generic
version of the existing group_cpus_evenly(), which always distributes
all present CPUs into groups.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 include/linux/group_cpus.h |  3 ++
 lib/group_cpus.c           | 59 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 62 insertions(+)

diff --git a/include/linux/group_cpus.h b/include/linux/group_cpus.h
index 9d4e5ab6c314..defab4123a82 100644
--- a/include/linux/group_cpus.h
+++ b/include/linux/group_cpus.h
@@ -10,5 +10,8 @@
 #include <linux/cpu.h>
 
 struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks);
+struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
+				       const struct cpumask *mask,
+				       unsigned int *nummasks);
 
 #endif
diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index b8d54398f88a..d3e9a20250ff 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -8,6 +8,7 @@
 #include <linux/cpu.h>
 #include <linux/sort.h>
 #include <linux/group_cpus.h>
+#include <linux/sched/isolation.h>
 
 static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 				unsigned int cpus_per_grp)
@@ -563,3 +564,61 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
 	return masks;
 }
 EXPORT_SYMBOL_GPL(group_cpus_evenly);
+
+/**
+ * group_mask_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * @numgrps: number of cpumasks to create
+ * @mask: CPUs to consider for the grouping
+ * @nummasks: number of initialized cpusmasks
+ *
+ * Return: cpumask array if successful, NULL otherwise. Only the CPUs
+ * marked in the mask will be considered for the grouping. And each
+ * element includes CPUs assigned to this group. nummasks contains the
+ * number of initialized masks which can be less than numgrps. cpu_mask
+ *
+ * Try to put close CPUs from viewpoint of CPU and NUMA locality into
+ * same group, and run two-stage grouping:
+ *	1) allocate present CPUs on these groups evenly first
+ *	2) allocate other possible CPUs on these groups evenly
+ *
+ * We guarantee in the resulted grouping that all CPUs are covered, and
+ * no same CPU is assigned to multiple groups
+ */
+struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
+				       const struct cpumask *mask,
+				       unsigned int *nummasks)
+{
+	cpumask_var_t *node_to_cpumask;
+	cpumask_var_t nmsk;
+	int ret = -ENOMEM;
+	struct cpumask *masks = NULL;
+
+	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+		return NULL;
+
+	node_to_cpumask = alloc_node_to_cpumask();
+	if (!node_to_cpumask)
+		goto fail_nmsk;
+
+	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
+	if (!masks)
+		goto fail_node_to_cpumask;
+
+	build_node_to_cpumask(node_to_cpumask);
+
+	ret = __group_cpus_evenly(0, numgrps, node_to_cpumask, mask, nmsk,
+				  masks);
+
+fail_node_to_cpumask:
+	free_node_to_cpumask(node_to_cpumask);
+
+fail_nmsk:
+	free_cpumask_var(nmsk);
+	if (ret < 0) {
+		kfree(masks);
+		return NULL;
+	}
+	*nummasks = ret;
+	return masks;
+}
+EXPORT_SYMBOL_GPL(group_mask_cpus_evenly);
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 04/13] genirq/affinity: Add cpumask to struct irq_affinity
  2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
                   ` (2 preceding siblings ...)
  2026-03-30 22:10 ` [PATCH v9 03/13] lib/group_cpus: Add group_mask_cpus_evenly() Aaron Tomlin
@ 2026-03-30 22:10 ` Aaron Tomlin
  2026-03-30 22:10 ` [PATCH v9 05/13] blk-mq: add blk_mq_{online|possible}_queue_affinity Aaron Tomlin
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Aaron Tomlin @ 2026-03-30 22:10 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, steve, sean, chjohnst, neelx, mproche, linux-block,
	linux-kernel, virtualization, linux-nvme, linux-scsi,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

From: Daniel Wagner <wagi@kernel.org>

Pass a cpumask to irq_create_affinity_masks as an additional constraint
to consider when creating the affinity masks. This allows the caller to
exclude specific CPUs, e.g., isolated CPUs (see the 'isolcpus' kernel
command-line parameter).

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 include/linux/interrupt.h | 16 ++++++++++------
 kernel/irq/affinity.c     | 12 ++++++++++--
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 6cd26ffb0505..afd5a2c75b43 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -287,18 +287,22 @@ struct irq_affinity_notify {
  * @nr_sets:		The number of interrupt sets for which affinity
  *			spreading is required
  * @set_size:		Array holding the size of each interrupt set
+ * @mask:		cpumask that constrains which CPUs to consider when
+ *			calculating the number and size of the interrupt sets
  * @calc_sets:		Callback for calculating the number and size
  *			of interrupt sets
  * @priv:		Private data for usage by @calc_sets, usually a
  *			pointer to driver/device specific data.
  */
 struct irq_affinity {
-	unsigned int	pre_vectors;
-	unsigned int	post_vectors;
-	unsigned int	nr_sets;
-	unsigned int	set_size[IRQ_AFFINITY_MAX_SETS];
-	void		(*calc_sets)(struct irq_affinity *, unsigned int nvecs);
-	void		*priv;
+	unsigned int		pre_vectors;
+	unsigned int		post_vectors;
+	unsigned int		nr_sets;
+	unsigned int		set_size[IRQ_AFFINITY_MAX_SETS];
+	const struct cpumask	*mask;
+	void			(*calc_sets)(struct irq_affinity *,
+					     unsigned int nvecs);
+	void			*priv;
 };
 
 /**
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 85c45cfe7223..076a5ef1e306 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -70,7 +70,13 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	 */
 	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
 		unsigned int nr_masks, this_vecs = affd->set_size[i];
-		struct cpumask *result = group_cpus_evenly(this_vecs, &nr_masks);
+		struct cpumask *result;
+
+		if (affd->mask)
+			result = group_mask_cpus_evenly(this_vecs, affd->mask,
+							&nr_masks);
+		else
+			result = group_cpus_evenly(this_vecs, &nr_masks);
 
 		if (!result) {
 			kfree(masks);
@@ -115,7 +121,9 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec,
 	if (resv > minvec)
 		return 0;
 
-	if (affd->calc_sets) {
+	if (affd->mask) {
+		set_vecs = cpumask_weight(affd->mask);
+	} else if (affd->calc_sets) {
 		set_vecs = maxvec - resv;
 	} else {
 		cpus_read_lock();
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 05/13] blk-mq: add blk_mq_{online|possible}_queue_affinity
  2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
                   ` (3 preceding siblings ...)
  2026-03-30 22:10 ` [PATCH v9 04/13] genirq/affinity: Add cpumask to struct irq_affinity Aaron Tomlin
@ 2026-03-30 22:10 ` Aaron Tomlin
  2026-03-30 22:10 ` [PATCH v9 06/13] nvme-pci: use block layer helpers to constrain queue affinity Aaron Tomlin
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Aaron Tomlin @ 2026-03-30 22:10 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, steve, sean, chjohnst, neelx, mproche, linux-block,
	linux-kernel, virtualization, linux-nvme, linux-scsi,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

From: Daniel Wagner <wagi@kernel.org>

Introduce blk_mq_{online|possible}_queue_affinity, which returns the
queue-to-CPU mapping constraints defined by the block layer. This allows
other subsystems (e.g., IRQ affinity setup) to respect block layer
requirements.

It is necessary to provide versions for both the online and possible CPU
masks because some drivers want to spread their I/O queues only across
online CPUs, while others prefer to use all possible CPUs. And the mask
used needs to match with the number of queues requested
(see blk_num_{online|possible}_queues).

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq-cpumap.c  | 24 ++++++++++++++++++++++++
 include/linux/blk-mq.h |  2 ++
 2 files changed, 26 insertions(+)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 705da074ad6c..8244ecf87835 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -26,6 +26,30 @@ static unsigned int blk_mq_num_queues(const struct cpumask *mask,
 	return min_not_zero(num, max_queues);
 }
 
+/**
+ * blk_mq_possible_queue_affinity - Return block layer queue affinity
+ *
+ * Returns an affinity mask that represents the queue-to-CPU mapping
+ * requested by the block layer based on possible CPUs.
+ */
+const struct cpumask *blk_mq_possible_queue_affinity(void)
+{
+	return cpu_possible_mask;
+}
+EXPORT_SYMBOL_GPL(blk_mq_possible_queue_affinity);
+
+/**
+ * blk_mq_online_queue_affinity - Return block layer queue affinity
+ *
+ * Returns an affinity mask that represents the queue-to-CPU mapping
+ * requested by the block layer based on online CPUs.
+ */
+const struct cpumask *blk_mq_online_queue_affinity(void)
+{
+	return cpu_online_mask;
+}
+EXPORT_SYMBOL_GPL(blk_mq_online_queue_affinity);
+
 /**
  * blk_mq_num_possible_queues - Calc nr of queues for multiqueue devices
  * @max_queues:	The maximum number of queues the hardware/driver
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581..ebc45557aee8 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -969,6 +969,8 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 void blk_mq_unfreeze_queue_non_owner(struct request_queue *q);
 void blk_freeze_queue_start_non_owner(struct request_queue *q);
 
+const struct cpumask *blk_mq_possible_queue_affinity(void);
+const struct cpumask *blk_mq_online_queue_affinity(void);
 unsigned int blk_mq_num_possible_queues(unsigned int max_queues);
 unsigned int blk_mq_num_online_queues(unsigned int max_queues);
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap);
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 06/13] nvme-pci: use block layer helpers to constrain queue affinity
  2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
                   ` (4 preceding siblings ...)
  2026-03-30 22:10 ` [PATCH v9 05/13] blk-mq: add blk_mq_{online|possible}_queue_affinity Aaron Tomlin
@ 2026-03-30 22:10 ` Aaron Tomlin
  2026-03-30 22:10 ` [PATCH v9 07/13] scsi: Use " Aaron Tomlin
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Aaron Tomlin @ 2026-03-30 22:10 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, steve, sean, chjohnst, neelx, mproche, linux-block,
	linux-kernel, virtualization, linux-nvme, linux-scsi,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

From: Daniel Wagner <wagi@kernel.org>

Ensure that IRQ affinity setup also respects the queue-to-CPU mapping
constraints provided by the block layer. This allows the NVMe driver
to avoid assigning interrupts to CPUs that the block layer has excluded
(e.g., isolated CPUs).

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/nvme/host/pci.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b78ba239c8ea..8e05ad06283e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2862,6 +2862,7 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 		.pre_vectors	= 1,
 		.calc_sets	= nvme_calc_irq_sets,
 		.priv		= dev,
+		.mask		= blk_mq_possible_queue_affinity(),
 	};
 	unsigned int irq_queues, poll_queues;
 	unsigned int flags = PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY;
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 07/13] scsi: Use block layer helpers to constrain queue affinity
  2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
                   ` (5 preceding siblings ...)
  2026-03-30 22:10 ` [PATCH v9 06/13] nvme-pci: use block layer helpers to constrain queue affinity Aaron Tomlin
@ 2026-03-30 22:10 ` Aaron Tomlin
  2026-03-30 22:10 ` [PATCH v9 08/13] virtio: blk/scsi: use " Aaron Tomlin
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Aaron Tomlin @ 2026-03-30 22:10 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, steve, sean, chjohnst, neelx, mproche, linux-block,
	linux-kernel, virtualization, linux-nvme, linux-scsi,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

From: Daniel Wagner <wagi@kernel.org>

Ensure that IRQ affinity setup also respects the queue-to-CPU mapping
constraints provided by the block layer. This allows the SCSI drivers
to avoid assigning interrupts to CPUs that the block layer has excluded
(e.g., isolated CPUs).

Only convert drivers which are already using the
pci_alloc_irq_vectors_affinity with the PCI_IRQ_AFFINITY flag set.
Because these drivers are enabled to let the IRQ core code to
set the affinity. Also don't update qla2xxx because the nvme-fabrics
code is not ready yet.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    | 1 +
 drivers/scsi/megaraid/megaraid_sas_base.c | 5 ++++-
 drivers/scsi/mpi3mr/mpi3mr_fw.c           | 6 +++++-
 drivers/scsi/mpt3sas/mpt3sas_base.c       | 5 ++++-
 drivers/scsi/pm8001/pm8001_init.c         | 1 +
 5 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index f69efc6494b8..d1f689224e7b 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -2605,6 +2605,7 @@ static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
 	struct pci_dev *pdev = hisi_hba->pci_dev;
 	struct irq_affinity desc = {
 		.pre_vectors = BASE_VECTORS_V3_HW,
+		.mask = blk_mq_online_queue_affinity(),
 	};
 
 	min_msi = MIN_AFFINE_VECTORS_V3_HW;
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index ac71ea4898b2..7e2a3c187ee0 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -5925,7 +5925,10 @@ static int
 __megasas_alloc_irq_vectors(struct megasas_instance *instance)
 {
 	int i, irq_flags;
-	struct irq_affinity desc = { .pre_vectors = instance->low_latency_index_start };
+	struct irq_affinity desc = {
+		.pre_vectors = instance->low_latency_index_start,
+		.mask = blk_mq_online_queue_affinity(),
+	};
 	struct irq_affinity *descp = &desc;
 
 	irq_flags = PCI_IRQ_MSIX;
diff --git a/drivers/scsi/mpi3mr/mpi3mr_fw.c b/drivers/scsi/mpi3mr/mpi3mr_fw.c
index c744210cc901..f9b8b3639c64 100644
--- a/drivers/scsi/mpi3mr/mpi3mr_fw.c
+++ b/drivers/scsi/mpi3mr/mpi3mr_fw.c
@@ -830,7 +830,11 @@ static int mpi3mr_setup_isr(struct mpi3mr_ioc *mrioc, u8 setup_one)
 	int max_vectors, min_vec;
 	int retval;
 	int i;
-	struct irq_affinity desc = { .pre_vectors =  1, .post_vectors = 1 };
+	struct irq_affinity desc = {
+		.pre_vectors =  1,
+		.post_vectors = 1,
+		.mask = blk_mq_online_queue_affinity(),
+	};
 
 	if (mrioc->is_intr_info_set)
 		return 0;
diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c b/drivers/scsi/mpt3sas/mpt3sas_base.c
index 79052f2accbd..91e1622b5b77 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -3370,7 +3370,10 @@ static int
 _base_alloc_irq_vectors(struct MPT3SAS_ADAPTER *ioc)
 {
 	int i, irq_flags = PCI_IRQ_MSIX;
-	struct irq_affinity desc = { .pre_vectors = ioc->high_iops_queues };
+	struct irq_affinity desc = {
+		.pre_vectors = ioc->high_iops_queues,
+		.mask = blk_mq_online_queue_affinity(),
+	};
 	struct irq_affinity *descp = &desc;
 	/*
 	 * Don't allocate msix vectors for poll_queues.
diff --git a/drivers/scsi/pm8001/pm8001_init.c b/drivers/scsi/pm8001/pm8001_init.c
index e93ea76b565e..6360fa95bcf4 100644
--- a/drivers/scsi/pm8001/pm8001_init.c
+++ b/drivers/scsi/pm8001/pm8001_init.c
@@ -978,6 +978,7 @@ static u32 pm8001_setup_msix(struct pm8001_hba_info *pm8001_ha)
 		 */
 		struct irq_affinity desc = {
 			.pre_vectors = 1,
+			.mask = blk_mq_online_queue_affinity(),
 		};
 		rc = pci_alloc_irq_vectors_affinity(
 				pm8001_ha->pdev, 2, PM8001_MAX_MSIX_VEC,
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 08/13] virtio: blk/scsi: use block layer helpers to constrain queue affinity
  2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
                   ` (6 preceding siblings ...)
  2026-03-30 22:10 ` [PATCH v9 07/13] scsi: Use " Aaron Tomlin
@ 2026-03-30 22:10 ` Aaron Tomlin
  2026-03-30 22:10 ` [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type Aaron Tomlin
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Aaron Tomlin @ 2026-03-30 22:10 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, steve, sean, chjohnst, neelx, mproche, linux-block,
	linux-kernel, virtualization, linux-nvme, linux-scsi,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

From: Daniel Wagner <wagi@kernel.org>

Ensure that IRQ affinity setup also respects the queue-to-CPU mapping
constraints provided by the block layer. This allows the virtio drivers
to avoid assigning interrupts to CPUs that the block layer has excluded
(e.g., isolated CPUs).

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/block/virtio_blk.c | 4 +++-
 drivers/scsi/virtio_scsi.c | 5 ++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index b1c9a27fe00f..9d737510454b 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -964,7 +964,9 @@ static int init_vq(struct virtio_blk *vblk)
 	unsigned short num_vqs;
 	unsigned short num_poll_vqs;
 	struct virtio_device *vdev = vblk->vdev;
-	struct irq_affinity desc = { 0, };
+	struct irq_affinity desc = {
+		.mask = blk_mq_possible_queue_affinity(),
+	};
 
 	err = virtio_cread_feature(vdev, VIRTIO_BLK_F_MQ,
 				   struct virtio_blk_config, num_queues,
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 0ed8558dad72..520a7da5386e 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -849,7 +849,10 @@ static int virtscsi_init(struct virtio_device *vdev,
 	u32 num_vqs, num_poll_vqs, num_req_vqs;
 	struct virtqueue_info *vqs_info;
 	struct virtqueue **vqs;
-	struct irq_affinity desc = { .pre_vectors = 2 };
+	struct irq_affinity desc = {
+		.pre_vectors = 2,
+		.mask = blk_mq_possible_queue_affinity(),
+	};
 
 	num_req_vqs = vscsi->num_queues;
 	num_vqs = num_req_vqs + VIRTIO_SCSI_VQ_BASE;
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type
  2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
                   ` (7 preceding siblings ...)
  2026-03-30 22:10 ` [PATCH v9 08/13] virtio: blk/scsi: use " Aaron Tomlin
@ 2026-03-30 22:10 ` Aaron Tomlin
  2026-04-01 12:49   ` Sebastian Andrzej Siewior
  2026-03-30 22:10 ` [PATCH v9 10/13] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Aaron Tomlin
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 29+ messages in thread
From: Aaron Tomlin @ 2026-03-30 22:10 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, steve, sean, chjohnst, neelx, mproche, linux-block,
	linux-kernel, virtualization, linux-nvme, linux-scsi,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

From: Daniel Wagner <wagi@kernel.org>

Multiqueue drivers spread I/O queues across all CPUs for optimal
performance. However, these drivers are not aware of CPU isolation
requirements and will distribute queues without considering the isolcpus
configuration.

Introduce a new isolcpus mask that allows users to define which CPUs
should have I/O queues assigned. This is similar to managed_irq, but
intended for drivers that do not use the managed IRQ infrastructure

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 include/linux/sched/isolation.h | 1 +
 kernel/sched/isolation.c        | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index dc3975ff1b2e..7b266fc2a405 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -18,6 +18,7 @@ enum hk_type {
 	HK_TYPE_MANAGED_IRQ,
 	/* Inverse of boot-time nohz_full= or isolcpus=nohz arguments */
 	HK_TYPE_KERNEL_NOISE,
+	HK_TYPE_IO_QUEUE,
 	HK_TYPE_MAX,
 
 	/*
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index ef152d401fe2..3406e3024fd4 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -16,6 +16,7 @@ enum hk_flags {
 	HK_FLAG_DOMAIN		= BIT(HK_TYPE_DOMAIN),
 	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
 	HK_FLAG_KERNEL_NOISE	= BIT(HK_TYPE_KERNEL_NOISE),
+	HK_FLAG_IO_QUEUE	= BIT(HK_TYPE_IO_QUEUE),
 };
 
 DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
@@ -340,6 +341,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
 			continue;
 		}
 
+		if (!strncmp(str, "io_queue,", 9)) {
+			str += 9;
+			flags |= HK_FLAG_IO_QUEUE;
+			continue;
+		}
+
 		/*
 		 * Skip unknown sub-parameter and validate that it is not
 		 * containing an invalid character.
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type
  2026-03-30 22:10 ` [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type Aaron Tomlin
@ 2026-04-01 12:49   ` Sebastian Andrzej Siewior
  2026-04-01 19:05     ` Waiman Long
  2026-04-01 20:58     ` Aaron Tomlin
  0 siblings, 2 replies; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-04-01 12:49 UTC (permalink / raw)
  To: Aaron Tomlin
  Cc: axboe, kbusch, hch, sagi, mst, aacraid, James.Bottomley,
	martin.petersen, liyihang9, kashyap.desai, sumit.saxena,
	shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
	sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
	jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
	akpm, maz, ruanjinjie, yphbchou0911, wagi, frederic, longman,
	chenridong, hare, kch, ming.lei, steve, sean, chjohnst, neelx,
	mproche, linux-block, linux-kernel, virtualization, linux-nvme,
	linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On 2026-03-30 18:10:43 [-0400], Aaron Tomlin wrote:
> From: Daniel Wagner <wagi@kernel.org>
> 
> Multiqueue drivers spread I/O queues across all CPUs for optimal
> performance. However, these drivers are not aware of CPU isolation
> requirements and will distribute queues without considering the isolcpus
> configuration.
> 
> Introduce a new isolcpus mask that allows users to define which CPUs
> should have I/O queues assigned. This is similar to managed_irq, but
> intended for drivers that do not use the managed IRQ infrastructure

I set down and documented the behaviour of managed_irq at
	https://lore.kernel.org/all/20260401110232.ET5RxZfl@linutronix.de/

Could we please clarify whether we want to keep it and this
additionally or if managed_irq could be used instead. This adds another
bit. If networking folks jump in on managed_irqs, would they need to
duplicate this with their net sub flag?

> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
> Signed-off-by: Daniel Wagner <wagi@kernel.org>

Sebastian


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type
  2026-04-01 12:49   ` Sebastian Andrzej Siewior
@ 2026-04-01 19:05     ` Waiman Long
  2026-04-02  7:58       ` Sebastian Andrzej Siewior
  2026-04-01 20:58     ` Aaron Tomlin
  1 sibling, 1 reply; 29+ messages in thread
From: Waiman Long @ 2026-04-01 19:05 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Aaron Tomlin
  Cc: axboe, kbusch, hch, sagi, mst, aacraid, James.Bottomley,
	martin.petersen, liyihang9, kashyap.desai, sumit.saxena,
	shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
	sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
	jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
	akpm, maz, ruanjinjie, yphbchou0911, wagi, frederic, chenridong,
	hare, kch, ming.lei, steve, sean, chjohnst, neelx, mproche,
	linux-block, linux-kernel, virtualization, linux-nvme, linux-scsi,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

On 4/1/26 8:49 AM, Sebastian Andrzej Siewior wrote:
> On 2026-03-30 18:10:43 [-0400], Aaron Tomlin wrote:
>> From: Daniel Wagner <wagi@kernel.org>
>>
>> Multiqueue drivers spread I/O queues across all CPUs for optimal
>> performance. However, these drivers are not aware of CPU isolation
>> requirements and will distribute queues without considering the isolcpus
>> configuration.
>>
>> Introduce a new isolcpus mask that allows users to define which CPUs
>> should have I/O queues assigned. This is similar to managed_irq, but
>> intended for drivers that do not use the managed IRQ infrastructure
> I set down and documented the behaviour of managed_irq at
> 	https://lore.kernel.org/all/20260401110232.ET5RxZfl@linutronix.de/
>
> Could we please clarify whether we want to keep it and this
> additionally or if managed_irq could be used instead. This adds another
> bit. If networking folks jump in on managed_irqs, would they need to
> duplicate this with their net sub flag?

Yes, I will very much prefer to reuse an existing HK cpumask like 
managed_irqs for this purpose, if possible, rather than adding another 
cpumask that we need to manage. Note that we are in the process of 
making these housekeeping cpumasks modifiable at run time in the near 
future.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type
  2026-04-01 19:05     ` Waiman Long
@ 2026-04-02  7:58       ` Sebastian Andrzej Siewior
  2026-04-03  1:54         ` Waiman Long
  0 siblings, 1 reply; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-04-02  7:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Aaron Tomlin, axboe, kbusch, hch, sagi, mst, aacraid,
	James.Bottomley, martin.petersen, liyihang9, kashyap.desai,
	sumit.saxena, shivasharan.srikanteshwara, chandrakanth.patil,
	sathya.prakash, sreekanth.reddy, suganath-prabu.subramani,
	ranjan.kumar, jinpu.wang, tglx, mingo, peterz, juri.lelli,
	vincent.guittot, akpm, maz, ruanjinjie, yphbchou0911, wagi,
	frederic, chenridong, hare, kch, ming.lei, steve, sean, chjohnst,
	neelx, mproche, linux-block, linux-kernel, virtualization,
	linux-nvme, linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On 2026-04-01 15:05:21 [-0400], Waiman Long wrote:
> > Could we please clarify whether we want to keep it and this
> > additionally or if managed_irq could be used instead. This adds another
> > bit. If networking folks jump in on managed_irqs, would they need to
> > duplicate this with their net sub flag?
> 
> Yes, I will very much prefer to reuse an existing HK cpumask like
> managed_irqs for this purpose, if possible, rather than adding another
> cpumask that we need to manage. Note that we are in the process of making
> these housekeeping cpumasks modifiable at run time in the near future.

Now if you want to change it at run time, it would mean to reconfigure
the interrupts, device and so on. Not sure if this useful _or_ if it
would be "easier" to just the tell upper layer (block in case of I/O)
not to perform any request on this CPU. Then we would have an interrupt
on that CPU but it wouldn't do anything.
It would only become a problem if you would have less queues than CPUs
and you would like to migrate things.

> Cheers,
> Longman

Sebastian


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type
  2026-04-02  7:58       ` Sebastian Andrzej Siewior
@ 2026-04-03  1:54         ` Waiman Long
  0 siblings, 0 replies; 29+ messages in thread
From: Waiman Long @ 2026-04-03  1:54 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Aaron Tomlin, axboe, kbusch, hch, sagi, mst, aacraid,
	James.Bottomley, martin.petersen, liyihang9, kashyap.desai,
	sumit.saxena, shivasharan.srikanteshwara, chandrakanth.patil,
	sathya.prakash, sreekanth.reddy, suganath-prabu.subramani,
	ranjan.kumar, jinpu.wang, tglx, mingo, peterz, juri.lelli,
	vincent.guittot, akpm, maz, ruanjinjie, yphbchou0911, wagi,
	frederic, chenridong, hare, kch, ming.lei, steve, sean, chjohnst,
	neelx, mproche, linux-block, linux-kernel, virtualization,
	linux-nvme, linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On 4/2/26 3:58 AM, Sebastian Andrzej Siewior wrote:
> On 2026-04-01 15:05:21 [-0400], Waiman Long wrote:
>>> Could we please clarify whether we want to keep it and this
>>> additionally or if managed_irq could be used instead. This adds another
>>> bit. If networking folks jump in on managed_irqs, would they need to
>>> duplicate this with their net sub flag?
>> Yes, I will very much prefer to reuse an existing HK cpumask like
>> managed_irqs for this purpose, if possible, rather than adding another
>> cpumask that we need to manage. Note that we are in the process of making
>> these housekeeping cpumasks modifiable at run time in the near future.
> Now if you want to change it at run time, it would mean to reconfigure
> the interrupts, device and so on. Not sure if this useful _or_ if it
> would be "easier" to just the tell upper layer (block in case of I/O)
> not to perform any request on this CPU. Then we would have an interrupt
> on that CPU but it wouldn't do anything.
> It would only become a problem if you would have less queues than CPUs
> and you would like to migrate things.

I know that it will not be easy. Anyway, I will let this patch reaching 
a good compromise and get merged first before thinking about that.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type
  2026-04-01 12:49   ` Sebastian Andrzej Siewior
  2026-04-01 19:05     ` Waiman Long
@ 2026-04-01 20:58     ` Aaron Tomlin
  2026-04-02  9:09       ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 29+ messages in thread
From: Aaron Tomlin @ 2026-04-01 20:58 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: axboe, kbusch, hch, sagi, mst, aacraid, James.Bottomley,
	martin.petersen, liyihang9, kashyap.desai, sumit.saxena,
	shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
	sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
	jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
	akpm, maz, ruanjinjie, yphbchou0911, wagi, frederic, longman,
	chenridong, hare, kch, ming.lei, steve, sean, chjohnst, neelx,
	mproche, linux-block, linux-kernel, virtualization, linux-nvme,
	linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

[-- Attachment #1: Type: text/plain, Size: 3646 bytes --]

On Wed, Apr 01, 2026 at 02:49:47PM +0200, Sebastian Andrzej Siewior wrote:
> On 2026-03-30 18:10:43 [-0400], Aaron Tomlin wrote:
> > From: Daniel Wagner <wagi@kernel.org>
> > 
> > Multiqueue drivers spread I/O queues across all CPUs for optimal
> > performance. However, these drivers are not aware of CPU isolation
> > requirements and will distribute queues without considering the isolcpus
> > configuration.
> > 
> > Introduce a new isolcpus mask that allows users to define which CPUs
> > should have I/O queues assigned. This is similar to managed_irq, but
> > intended for drivers that do not use the managed IRQ infrastructure
> 
> I set down and documented the behaviour of managed_irq at
> 	https://lore.kernel.org/all/20260401110232.ET5RxZfl@linutronix.de/
> 
> Could we please clarify whether we want to keep it and this
> additionally or if managed_irq could be used instead. This adds another
> bit. If networking folks jump in on managed_irqs, would they need to
> duplicate this with their net sub flag?

Hi Sebastian,

Thank you for taking the time to document the "managed_irq" behaviour; it
is immensely helpful. You raise a highly pertinent point regarding the
potential proliferation of "isolcpus=" flags. It is certainly a situation
that must be managed carefully to prevent every subsystem from demanding
its own bit.

To clarify the reasoning behind introducing "io_queue" rather than strictly
relying on managed_irq:

The managed_irq flag belongs firmly to the interrupt subsystem. It dictates
whether a CPU is eligible to receive hardware interrupts whose affinity is
managed by the kernel. Whilst many modern block drivers use managed IRQs,
the block layer multi-queue mapping encompasses far more than just
interrupt routing. It maps logical queues to CPUs to handle I/O submission,
software queues, and crucially, poll queues, which do not utilise
interrupts at all. Furthermore, there are specific drivers that do not use
the managed IRQ infrastructure but still rely on the block layer for queue
distribution.

If managed_irq were solely relied upon, the IRQ subsystem would
successfully keep hardware interrupts off the isolated CPUs, but the block
layer would still blindly map polling queues or non-managed queues to those
same isolated CPUs. This would force isolated CPUs to process I/O
submissions or handle polling tasks, thereby breaking the strict isolation.

Regarding the point about the networking subsystem, it is a very valid
comparison. If the networking layer wishes to respect isolcpus in the
future, adding a net flag would indeed exacerbate the bit proliferation.

For the present time, retaining io_queue seems the most prudent approach to
ensure that block queue mapping remains semantically distinct from
interrupt delivery. This provides an immediate and clean architectural
boundary. However, if the consensus amongst the maintainers suggests that
this is too granular, alternative approaches could certainly be considered
for the future. For instance, a broader, more generic flag could be
introduced to encompass both block and future networking queue mappings.
Alternatively, if semantic conflation is deemed acceptable, the existing
managed_irq housekeeping mask could simply be overloaded within the block
layer to restrict all queue mappings.

Keeping the current separation appears to be the cleanest solution for this
series, but your thoughts, and those of the wider community, on potentially
migrating to a consolidated generic flag in the future would be very much
welcomed.

Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type
  2026-04-01 20:58     ` Aaron Tomlin
@ 2026-04-02  9:09       ` Sebastian Andrzej Siewior
  2026-04-03  0:50         ` Aaron Tomlin
  0 siblings, 1 reply; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-04-02  9:09 UTC (permalink / raw)
  To: Aaron Tomlin
  Cc: axboe, kbusch, hch, sagi, mst, aacraid, James.Bottomley,
	martin.petersen, liyihang9, kashyap.desai, sumit.saxena,
	shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
	sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
	jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
	akpm, maz, ruanjinjie, yphbchou0911, wagi, frederic, longman,
	chenridong, hare, kch, ming.lei, steve, sean, chjohnst, neelx,
	mproche, linux-block, linux-kernel, virtualization, linux-nvme,
	linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On 2026-04-01 16:58:22 [-0400], Aaron Tomlin wrote:
> Hi Sebastian,
Hi,

> Thank you for taking the time to document the "managed_irq" behaviour; it
> is immensely helpful. You raise a highly pertinent point regarding the
> potential proliferation of "isolcpus=" flags. It is certainly a situation
> that must be managed carefully to prevent every subsystem from demanding
> its own bit.
> 
> To clarify the reasoning behind introducing "io_queue" rather than strictly
> relying on managed_irq:
> 
> The managed_irq flag belongs firmly to the interrupt subsystem. It dictates
> whether a CPU is eligible to receive hardware interrupts whose affinity is
> managed by the kernel. Whilst many modern block drivers use managed IRQs,
> the block layer multi-queue mapping encompasses far more than just
> interrupt routing. It maps logical queues to CPUs to handle I/O submission,
> software queues, and crucially, poll queues, which do not utilise
> interrupts at all. Furthermore, there are specific drivers that do not use
> the managed IRQ infrastructure but still rely on the block layer for queue
> distribution.

Could you tell block which queue maps to which CPU at /sys/block/$$/mq/
level? Then you have one queue going to one CPU.
Then the drive could request one or more interrupts managed or not. For
managed you could specify a CPU mask which you desire to occupy.
You have the case where
- you have more queues than CPUs
  - use all of them
  - use less
- less queues than CPUs
  - mapped a queue to more than once CPU in case it goes down or becomes
    not available
  - mapped to one CPU

Ideally you solve this at one level so that the device(s) can request
less queues than CPUs if told so without patching each and every driver.

This should give you the freedom to isolate CPUs, decide at boot time
which CPUs get I/O queues assigned. At run time you can tell which
queues go to which CPUs. If you shutdown a queue, the interrupt remains
but does not get any I/O requests assigned so no problem. If the CPU
goes down, same thing.

I am trying to come up with a design here which I haven't found so far.
But I might be late to the party and everyone else is fully aware.

> If managed_irq were solely relied upon, the IRQ subsystem would
> successfully keep hardware interrupts off the isolated CPUs, but the block

The managed_irqs can't be influence by userland. The CPUs are auto
distributed.

> layer would still blindly map polling queues or non-managed queues to those
> same isolated CPUs. This would force isolated CPUs to process I/O
> submissions or handle polling tasks, thereby breaking the strict isolation.
> 
> Regarding the point about the networking subsystem, it is a very valid
> comparison. If the networking layer wishes to respect isolcpus in the
> future, adding a net flag would indeed exacerbate the bit proliferation.

Networking could also have different cases like adding a RX filter and
having HW putting packet based on it in a dedicated queue. But also in
this case I would like to have the freedome to decide which isolated
CPUs should receive interrupts/ traffic and which don't.

> For the present time, retaining io_queue seems the most prudent approach to
> ensure that block queue mapping remains semantically distinct from
> interrupt delivery. This provides an immediate and clean architectural
> boundary. However, if the consensus amongst the maintainers suggests that
> this is too granular, alternative approaches could certainly be considered
> for the future. For instance, a broader, more generic flag could be
> introduced to encompass both block and future networking queue mappings.
> Alternatively, if semantic conflation is deemed acceptable, the existing
> managed_irq housekeeping mask could simply be overloaded within the block
> layer to restrict all queue mappings.
> 
> Keeping the current separation appears to be the cleanest solution for this
> series, but your thoughts, and those of the wider community, on potentially
> migrating to a consolidated generic flag in the future would be very much
> welcomed.

I just don't like introducing yet another boot argument, making it a
boot constraint while in my naive view this could be managed at some
degree via sysfs as suggested above.

> 
> Kind regards,

Sebastian


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type
  2026-04-02  9:09       ` Sebastian Andrzej Siewior
@ 2026-04-03  0:50         ` Aaron Tomlin
  2026-04-03  1:20           ` Ming Lei
  0 siblings, 1 reply; 29+ messages in thread
From: Aaron Tomlin @ 2026-04-03  0:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: axboe, kbusch, hch, sagi, mst, aacraid, James.Bottomley,
	martin.petersen, liyihang9, kashyap.desai, sumit.saxena,
	shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
	sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
	jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
	akpm, maz, ruanjinjie, yphbchou0911, wagi, frederic, longman,
	chenridong, hare, kch, ming.lei, steve, sean, chjohnst, neelx,
	mproche, linux-block, linux-kernel, virtualization, linux-nvme,
	linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On Thu, Apr 02, 2026 at 11:09:40AM +0200, Sebastian Andrzej Siewior wrote:
> On 2026-04-01 16:58:22 [-0400], Aaron Tomlin wrote:
> > Hi Sebastian,
> Hi,
> 
> > Thank you for taking the time to document the "managed_irq" behaviour; it
> > is immensely helpful. You raise a highly pertinent point regarding the
> > potential proliferation of "isolcpus=" flags. It is certainly a situation
> > that must be managed carefully to prevent every subsystem from demanding
> > its own bit.
> > 
> > To clarify the reasoning behind introducing "io_queue" rather than strictly
> > relying on managed_irq:
> > 
> > The managed_irq flag belongs firmly to the interrupt subsystem. It dictates
> > whether a CPU is eligible to receive hardware interrupts whose affinity is
> > managed by the kernel. Whilst many modern block drivers use managed IRQs,
> > the block layer multi-queue mapping encompasses far more than just
> > interrupt routing. It maps logical queues to CPUs to handle I/O submission,
> > software queues, and crucially, poll queues, which do not utilise
> > interrupts at all. Furthermore, there are specific drivers that do not use
> > the managed IRQ infrastructure but still rely on the block layer for queue
> > distribution.
> 
> Could you tell block which queue maps to which CPU at /sys/block/$$/mq/
> level? Then you have one queue going to one CPU.
> Then the drive could request one or more interrupts managed or not. For
> managed you could specify a CPU mask which you desire to occupy.
> You have the case where
> - you have more queues than CPUs
>   - use all of them
>   - use less
> - less queues than CPUs
>   - mapped a queue to more than once CPU in case it goes down or becomes
>     not available
>   - mapped to one CPU
> 
> Ideally you solve this at one level so that the device(s) can request
> less queues than CPUs if told so without patching each and every driver.
> 
> This should give you the freedom to isolate CPUs, decide at boot time
> which CPUs get I/O queues assigned. At run time you can tell which
> queues go to which CPUs. If you shutdown a queue, the interrupt remains
> but does not get any I/O requests assigned so no problem. If the CPU
> goes down, same thing.
> 
> I am trying to come up with a design here which I haven't found so far.
> But I might be late to the party and everyone else is fully aware.
> 
> > If managed_irq were solely relied upon, the IRQ subsystem would
> > successfully keep hardware interrupts off the isolated CPUs, but the block
> 
> The managed_irqs can't be influence by userland. The CPUs are auto
> distributed.
> 
> > layer would still blindly map polling queues or non-managed queues to those
> > same isolated CPUs. This would force isolated CPUs to process I/O
> > submissions or handle polling tasks, thereby breaking the strict isolation.
> > 
> > Regarding the point about the networking subsystem, it is a very valid
> > comparison. If the networking layer wishes to respect isolcpus in the
> > future, adding a net flag would indeed exacerbate the bit proliferation.
> 
> Networking could also have different cases like adding a RX filter and
> having HW putting packet based on it in a dedicated queue. But also in
> this case I would like to have the freedome to decide which isolated
> CPUs should receive interrupts/ traffic and which don't.
> 
> > For the present time, retaining io_queue seems the most prudent approach to
> > ensure that block queue mapping remains semantically distinct from
> > interrupt delivery. This provides an immediate and clean architectural
> > boundary. However, if the consensus amongst the maintainers suggests that
> > this is too granular, alternative approaches could certainly be considered
> > for the future. For instance, a broader, more generic flag could be
> > introduced to encompass both block and future networking queue mappings.
> > Alternatively, if semantic conflation is deemed acceptable, the existing
> > managed_irq housekeeping mask could simply be overloaded within the block
> > layer to restrict all queue mappings.
> > 
> > Keeping the current separation appears to be the cleanest solution for this
> > series, but your thoughts, and those of the wider community, on potentially
> > migrating to a consolidated generic flag in the future would be very much
> > welcomed.
> 
> I just don't like introducing yet another boot argument, making it a
> boot constraint while in my naive view this could be managed at some
> degree via sysfs as suggested above.

Hi Sebastian,

I believe it would be more prudent to defer to Thomas Gleixner and Jens
Axboe on this matter.


Indeed, I am entirely sympathetic to your reluctance to introduce yet
another boot parameter, and I concur that run-time configurability
represents the ideal scenario for system tuning.

At present, a device such as an NVMe controller allocates its hardware
queues and requests its interrupt vectors during the initial device probe
phase. The block layer calculates the optimal queue to CPU mapping based on
the system topology at that precise moment. Altering this mapping
dynamically at runtime via sysfs would be an exceptionally intricate
undertaking. It would necessitate freezing all active operations, tearing
down the physical hardware queues on the device, renegotiating the
interrupt vectors with the peripheral component interconnect subsystem, and
finally reconstructing the entire queue map.

Furthermore, the proposed io_queue boot parameter successfully achieves the
objective of avoiding driver level modifications. By applying the
housekeeping mask constraint centrally within the core block layer mapping
helpers, all multiqueue drivers automatically inherit the CPU isolation
boundaries without requiring a single line of code to be changed within the
individual drivers themselves.

Because the hardware queue count and CPU alignment must be calculated as
the device initialises, a reliable mechanism is required to inform the
block layer of which CPUs are strictly isolated before the probe sequence
commences. This is precisely why integrating with the existing boot time
housekeeping infrastructure is currently the most viable and robust
solution.

Whilst a fully dynamic sysfs driven reconfiguration architecture would be a
great, it would represent a substantial paradigm shift for the block layer.
For the present time, the io_queue flag resolves the immediate and severe
latency issues experienced by users with isolated CPUs, employing an
established and safe methodology.

This is at least my understanding.


Kind regards,
-- 
Aaron Tomlin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type
  2026-04-03  0:50         ` Aaron Tomlin
@ 2026-04-03  1:20           ` Ming Lei
  0 siblings, 0 replies; 29+ messages in thread
From: Ming Lei @ 2026-04-03  1:20 UTC (permalink / raw)
  To: Aaron Tomlin
  Cc: Sebastian Andrzej Siewior, axboe, kbusch, hch, sagi, mst, aacraid,
	James.Bottomley, martin.petersen, liyihang9, kashyap.desai,
	sumit.saxena, shivasharan.srikanteshwara, chandrakanth.patil,
	sathya.prakash, sreekanth.reddy, suganath-prabu.subramani,
	ranjan.kumar, jinpu.wang, tglx, mingo, peterz, juri.lelli,
	vincent.guittot, akpm, maz, ruanjinjie, yphbchou0911, wagi,
	frederic, longman, chenridong, hare, kch, steve, sean, chjohnst,
	neelx, mproche, linux-block, linux-kernel, virtualization,
	linux-nvme, linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On Thu, Apr 02, 2026 at 08:50:55PM -0400, Aaron Tomlin wrote:
> On Thu, Apr 02, 2026 at 11:09:40AM +0200, Sebastian Andrzej Siewior wrote:
> > On 2026-04-01 16:58:22 [-0400], Aaron Tomlin wrote:
> > > Hi Sebastian,
> > Hi,
> > 
> > > Thank you for taking the time to document the "managed_irq" behaviour; it
> > > is immensely helpful. You raise a highly pertinent point regarding the
> > > potential proliferation of "isolcpus=" flags. It is certainly a situation
> > > that must be managed carefully to prevent every subsystem from demanding
> > > its own bit.
> > > 
> > > To clarify the reasoning behind introducing "io_queue" rather than strictly
> > > relying on managed_irq:
> > > 
> > > The managed_irq flag belongs firmly to the interrupt subsystem. It dictates
> > > whether a CPU is eligible to receive hardware interrupts whose affinity is
> > > managed by the kernel. Whilst many modern block drivers use managed IRQs,
> > > the block layer multi-queue mapping encompasses far more than just
> > > interrupt routing. It maps logical queues to CPUs to handle I/O submission,
> > > software queues, and crucially, poll queues, which do not utilise
> > > interrupts at all. Furthermore, there are specific drivers that do not use
> > > the managed IRQ infrastructure but still rely on the block layer for queue
> > > distribution.
> > 
> > Could you tell block which queue maps to which CPU at /sys/block/$$/mq/
> > level? Then you have one queue going to one CPU.
> > Then the drive could request one or more interrupts managed or not. For
> > managed you could specify a CPU mask which you desire to occupy.
> > You have the case where
> > - you have more queues than CPUs
> >   - use all of them
> >   - use less
> > - less queues than CPUs
> >   - mapped a queue to more than once CPU in case it goes down or becomes
> >     not available
> >   - mapped to one CPU
> > 
> > Ideally you solve this at one level so that the device(s) can request
> > less queues than CPUs if told so without patching each and every driver.
> > 
> > This should give you the freedom to isolate CPUs, decide at boot time
> > which CPUs get I/O queues assigned. At run time you can tell which
> > queues go to which CPUs. If you shutdown a queue, the interrupt remains
> > but does not get any I/O requests assigned so no problem. If the CPU
> > goes down, same thing.
> > 
> > I am trying to come up with a design here which I haven't found so far.
> > But I might be late to the party and everyone else is fully aware.
> > 
> > > If managed_irq were solely relied upon, the IRQ subsystem would
> > > successfully keep hardware interrupts off the isolated CPUs, but the block
> > 
> > The managed_irqs can't be influence by userland. The CPUs are auto
> > distributed.
> > 
> > > layer would still blindly map polling queues or non-managed queues to those
> > > same isolated CPUs. This would force isolated CPUs to process I/O
> > > submissions or handle polling tasks, thereby breaking the strict isolation.
> > > 
> > > Regarding the point about the networking subsystem, it is a very valid
> > > comparison. If the networking layer wishes to respect isolcpus in the
> > > future, adding a net flag would indeed exacerbate the bit proliferation.
> > 
> > Networking could also have different cases like adding a RX filter and
> > having HW putting packet based on it in a dedicated queue. But also in
> > this case I would like to have the freedome to decide which isolated
> > CPUs should receive interrupts/ traffic and which don't.
> > 
> > > For the present time, retaining io_queue seems the most prudent approach to
> > > ensure that block queue mapping remains semantically distinct from
> > > interrupt delivery. This provides an immediate and clean architectural
> > > boundary. However, if the consensus amongst the maintainers suggests that
> > > this is too granular, alternative approaches could certainly be considered
> > > for the future. For instance, a broader, more generic flag could be
> > > introduced to encompass both block and future networking queue mappings.
> > > Alternatively, if semantic conflation is deemed acceptable, the existing
> > > managed_irq housekeeping mask could simply be overloaded within the block
> > > layer to restrict all queue mappings.
> > > 
> > > Keeping the current separation appears to be the cleanest solution for this
> > > series, but your thoughts, and those of the wider community, on potentially
> > > migrating to a consolidated generic flag in the future would be very much
> > > welcomed.
> > 
> > I just don't like introducing yet another boot argument, making it a
> > boot constraint while in my naive view this could be managed at some
> > degree via sysfs as suggested above.
> 
> Hi Sebastian,
> 
> I believe it would be more prudent to defer to Thomas Gleixner and Jens
> Axboe on this matter.
> 
> 
> Indeed, I am entirely sympathetic to your reluctance to introduce yet
> another boot parameter, and I concur that run-time configurability
> represents the ideal scenario for system tuning.

`io_queue` introduces cost of potential failure on offlining CPU, so how
can it replace the existing `managed_irq`?

> 
> At present, a device such as an NVMe controller allocates its hardware
> queues and requests its interrupt vectors during the initial device probe
> phase. The block layer calculates the optimal queue to CPU mapping based on
> the system topology at that precise moment. Altering this mapping
> dynamically at runtime via sysfs would be an exceptionally intricate
> undertaking. It would necessitate freezing all active operations, tearing
> down the physical hardware queues on the device, renegotiating the
> interrupt vectors with the peripheral component interconnect subsystem, and
> finally reconstructing the entire queue map.
> 
> Furthermore, the proposed io_queue boot parameter successfully achieves the
> objective of avoiding driver level modifications. By applying the
> housekeeping mask constraint centrally within the core block layer mapping
> helpers, all multiqueue drivers automatically inherit the CPU isolation
> boundaries without requiring a single line of code to be changed within the
> individual drivers themselves.
> 
> Because the hardware queue count and CPU alignment must be calculated as
> the device initialises, a reliable mechanism is required to inform the
> block layer of which CPUs are strictly isolated before the probe sequence
> commences. This is precisely why integrating with the existing boot time
> housekeeping infrastructure is currently the most viable and robust
> solution.
> 
> Whilst a fully dynamic sysfs driven reconfiguration architecture would be a
> great, it would represent a substantial paradigm shift for the block layer.
> For the present time, the io_queue flag resolves the immediate and severe
> latency issues experienced by users with isolated CPUs, employing an
> established and safe methodology.

I'd suggest to document the exact existing problem, cause `managed_irq`
should cover it in try-best way, so people can know how to select the two
parameters.


Thanks,
Ming



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v9 10/13] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
                   ` (8 preceding siblings ...)
  2026-03-30 22:10 ` [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type Aaron Tomlin
@ 2026-03-30 22:10 ` Aaron Tomlin
  2026-03-31 23:05   ` Keith Busch
  2026-04-03  1:43   ` Ming Lei
  2026-03-30 22:10 ` [PATCH v9 11/13] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Aaron Tomlin
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 29+ messages in thread
From: Aaron Tomlin @ 2026-03-30 22:10 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, steve, sean, chjohnst, neelx, mproche, linux-block,
	linux-kernel, virtualization, linux-nvme, linux-scsi,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

From: Daniel Wagner <wagi@kernel.org>

Extend the capabilities of the generic CPU to hardware queue (hctx)
mapping code, so it maps houskeeping CPUs and isolated CPUs to the
hardware queues evenly.

A hctx is only operational when there is at least one online
housekeeping CPU assigned (aka active_hctx). Thus, check the final
mapping that there is no hctx which has only offline housekeeing CPU and
online isolated CPUs.

Example mapping result:

  16 online CPUs

  isolcpus=io_queue,2-3,6-7,12-13

Queue mapping:
        hctx0: default 0 2
        hctx1: default 1 3
        hctx2: default 4 6
        hctx3: default 5 7
        hctx4: default 8 12
        hctx5: default 9 13
        hctx6: default 10
        hctx7: default 11
        hctx8: default 14
        hctx9: default 15

IRQ mapping:
        irq 42 affinity 0 effective 0  nvme0q0
        irq 43 affinity 0 effective 0  nvme0q1
        irq 44 affinity 1 effective 1  nvme0q2
        irq 45 affinity 4 effective 4  nvme0q3
        irq 46 affinity 5 effective 5  nvme0q4
        irq 47 affinity 8 effective 8  nvme0q5
        irq 48 affinity 9 effective 9  nvme0q6
        irq 49 affinity 10 effective 10  nvme0q7
        irq 50 affinity 11 effective 11  nvme0q8
        irq 51 affinity 14 effective 14  nvme0q9
        irq 52 affinity 15 effective 15  nvme0q10

A corner case is when the number of online CPUs and present CPUs
differ and the driver asks for less queues than online CPUs, e.g.

  8 online CPUs, 16 possible CPUs

  isolcpus=io_queue,2-3,6-7,12-13
  virtio_blk.num_request_queues=2

Queue mapping:
        hctx0: default 0 1 2 3 4 5 6 7 8 12 13
        hctx1: default 9 10 11 14 15

IRQ mapping
        irq 27 affinity 0 effective 0 virtio0-config
        irq 28 affinity 0-1,4-5,8 effective 5 virtio0-req.0
        irq 29 affinity 9-11,14-15 effective 0 virtio0-req.1

Noteworthy is that for the normal/default configuration (!isoclpus) the
mapping will change for systems which have non hyperthreading CPUs. The
main assignment loop will completely rely that group_mask_cpus_evenly to
do the right thing. The old code would distribute the CPUs linearly over
the hardware context:

queue mapping for /dev/nvme0n1
        hctx0: default 0 8
        hctx1: default 1 9
        hctx2: default 2 10
        hctx3: default 3 11
        hctx4: default 4 12
        hctx5: default 5 13
        hctx6: default 6 14
        hctx7: default 7 15

The assign each hardware context the map generated by the
group_mask_cpus_evenly function:

queue mapping for /dev/nvme0n1
        hctx0: default 0 1
        hctx1: default 2 3
        hctx2: default 4 5
        hctx3: default 6 7
        hctx4: default 8 9
        hctx5: default 10 11
        hctx6: default 12 13
        hctx7: default 14 15

In case of hyperthreading CPUs, the resulting map stays the same.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq-cpumap.c | 177 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 158 insertions(+), 19 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 8244ecf87835..3b4fa3b291c9 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -22,7 +22,18 @@ static unsigned int blk_mq_num_queues(const struct cpumask *mask,
 {
 	unsigned int num;
 
-	num = cpumask_weight(mask);
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
+		const struct cpumask *hk_mask;
+		struct cpumask avail_mask;
+
+		hk_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+		cpumask_and(&avail_mask, mask, hk_mask);
+
+		num = cpumask_weight(&avail_mask);
+	} else {
+		num = cpumask_weight(mask);
+	}
+
 	return min_not_zero(num, max_queues);
 }
 
@@ -31,9 +42,13 @@ static unsigned int blk_mq_num_queues(const struct cpumask *mask,
  *
  * Returns an affinity mask that represents the queue-to-CPU mapping
  * requested by the block layer based on possible CPUs.
+ * This helper takes isolcpus settings into account.
  */
 const struct cpumask *blk_mq_possible_queue_affinity(void)
 {
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
+		return housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+
 	return cpu_possible_mask;
 }
 EXPORT_SYMBOL_GPL(blk_mq_possible_queue_affinity);
@@ -46,6 +61,14 @@ EXPORT_SYMBOL_GPL(blk_mq_possible_queue_affinity);
  */
 const struct cpumask *blk_mq_online_queue_affinity(void)
 {
+	/*
+	 * Return the stable housekeeping mask if enabled. Callers (e.g.,
+	 * the IRQ affinity core) are responsible for safely intersecting
+	 * this with a local snapshot of the online mask.
+	 */
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
+		return housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+
 	return cpu_online_mask;
 }
 EXPORT_SYMBOL_GPL(blk_mq_online_queue_affinity);
@@ -57,7 +80,8 @@ EXPORT_SYMBOL_GPL(blk_mq_online_queue_affinity);
  *		ignored.
  *
  * Calculates the number of queues to be used for a multiqueue
- * device based on the number of possible CPUs.
+ * device based on the number of possible CPUs. This helper
+ * takes isolcpus settings into account.
  */
 unsigned int blk_mq_num_possible_queues(unsigned int max_queues)
 {
@@ -72,7 +96,8 @@ EXPORT_SYMBOL_GPL(blk_mq_num_possible_queues);
  *		ignored.
  *
  * Calculates the number of queues to be used for a multiqueue
- * device based on the number of online CPUs.
+ * device based on the number of online CPUs. This helper
+ * takes isolcpus settings into account.
  */
 unsigned int blk_mq_num_online_queues(unsigned int max_queues)
 {
@@ -80,23 +105,104 @@ unsigned int blk_mq_num_online_queues(unsigned int max_queues)
 }
 EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
 
+static bool blk_mq_validate(struct blk_mq_queue_map *qmap,
+			    const struct cpumask *active_hctx)
+{
+	/*
+	 * Verify if the mapping is usable when housekeeping
+	 * configuration is enabled
+	 */
+
+	for (int queue = 0; queue < qmap->nr_queues; queue++) {
+		int cpu;
+
+		if (cpumask_test_cpu(queue, active_hctx)) {
+			/*
+			 * This htcx has at least one online CPU thus it
+			 * is able to serve any assigned isolated CPU.
+			 */
+			continue;
+		}
+
+		/*
+		 * There is no housekeeping online CPU for this hctx, all
+		 * good as long as all non houskeeping CPUs are also
+		 * offline.
+		 */
+		for_each_online_cpu(cpu) {
+			if (qmap->mq_map[cpu] != queue)
+				continue;
+
+			pr_warn("Unable to create a usable CPU-to-queue mapping with the given constraints\n");
+			return false;
+		}
+	}
+
+	return true;
+}
+
+static void blk_mq_map_fallback(struct blk_mq_queue_map *qmap)
+{
+	unsigned int cpu;
+
+	/*
+	 * Map all CPUs to the first hctx to ensure at least one online
+	 * CPU is serving it.
+	 */
+	for_each_possible_cpu(cpu)
+		qmap->mq_map[cpu] = 0;
+}
+
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
-	const struct cpumask *masks;
+	struct cpumask *masks __free(kfree) = NULL;
+	const struct cpumask *constraint;
 	unsigned int queue, cpu, nr_masks;
+	cpumask_var_t active_hctx;
 
-	masks = group_cpus_evenly(qmap->nr_queues, &nr_masks);
-	if (!masks) {
-		for_each_possible_cpu(cpu)
-			qmap->mq_map[cpu] = qmap->queue_offset;
-		return;
-	}
+	if (!zalloc_cpumask_var(&active_hctx, GFP_KERNEL))
+		goto fallback;
+
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
+		constraint = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+	else
+		constraint = cpu_possible_mask;
+
+	/* Map CPUs to the hardware contexts (hctx) */
+	masks = group_mask_cpus_evenly(qmap->nr_queues, constraint, &nr_masks);
+	if (!masks)
+		goto free_fallback;
 
 	for (queue = 0; queue < qmap->nr_queues; queue++) {
-		for_each_cpu(cpu, &masks[queue % nr_masks])
-			qmap->mq_map[cpu] = qmap->queue_offset + queue;
+		unsigned int idx = (qmap->queue_offset + queue) % nr_masks;
+
+		for_each_cpu(cpu, &masks[idx]) {
+			qmap->mq_map[cpu] = idx;
+
+			if (cpu_online(cpu))
+				cpumask_set_cpu(qmap->mq_map[cpu], active_hctx);
+		}
+	}
+
+	/* Map any unassigned CPU evenly to the hardware contexts (hctx) */
+	queue = cpumask_first(active_hctx);
+	for_each_cpu_andnot(cpu, cpu_possible_mask, constraint) {
+		qmap->mq_map[cpu] = (qmap->queue_offset + queue) % nr_masks;
+		queue = cpumask_next_wrap(queue, active_hctx);
 	}
-	kfree(masks);
+
+	if (!blk_mq_validate(qmap, active_hctx))
+		goto free_fallback;
+
+	free_cpumask_var(active_hctx);
+
+	return;
+
+free_fallback:
+	free_cpumask_var(active_hctx);
+
+fallback:
+	blk_mq_map_fallback(qmap);
 }
 EXPORT_SYMBOL_GPL(blk_mq_map_queues);
 
@@ -133,24 +239,57 @@ void blk_mq_map_hw_queues(struct blk_mq_queue_map *qmap,
 			  struct device *dev, unsigned int offset)
 
 {
-	const struct cpumask *mask;
+	cpumask_var_t active_hctx, mask;
 	unsigned int queue, cpu;
 
 	if (!dev->bus->irq_get_affinity)
 		goto fallback;
 
+	if (!zalloc_cpumask_var(&active_hctx, GFP_KERNEL))
+		goto fallback;
+
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
+		free_cpumask_var(active_hctx);
+		goto fallback;
+	}
+
+	/* Map CPUs to the hardware contexts (hctx) */
 	for (queue = 0; queue < qmap->nr_queues; queue++) {
-		mask = dev->bus->irq_get_affinity(dev, queue + offset);
-		if (!mask)
-			goto fallback;
+		const struct cpumask *affinity_mask;
+
+		affinity_mask = dev->bus->irq_get_affinity(dev, offset + queue);
+		if (!affinity_mask)
+			goto free_fallback;
 
-		for_each_cpu(cpu, mask)
+		for_each_cpu(cpu, affinity_mask) {
 			qmap->mq_map[cpu] = qmap->queue_offset + queue;
+
+			cpumask_set_cpu(cpu, mask);
+			if (cpu_online(cpu))
+				cpumask_set_cpu(qmap->mq_map[cpu], active_hctx);
+		}
 	}
 
+	/* Map any unassigned CPU evenly to the hardware contexts (hctx) */
+	queue = cpumask_first(active_hctx);
+	for_each_cpu_andnot(cpu, cpu_possible_mask, mask) {
+		qmap->mq_map[cpu] = qmap->queue_offset + queue;
+		queue = cpumask_next_wrap(queue, active_hctx);
+	}
+
+	if (!blk_mq_validate(qmap, active_hctx))
+		goto free_fallback;
+
+	free_cpumask_var(active_hctx);
+	free_cpumask_var(mask);
+
 	return;
 
+free_fallback:
+	free_cpumask_var(active_hctx);
+	free_cpumask_var(mask);
+
 fallback:
-	blk_mq_map_queues(qmap);
+	blk_mq_map_fallback(qmap);
 }
 EXPORT_SYMBOL_GPL(blk_mq_map_hw_queues);
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 10/13] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2026-03-30 22:10 ` [PATCH v9 10/13] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Aaron Tomlin
@ 2026-03-31 23:05   ` Keith Busch
  2026-04-01 17:16     ` Aaron Tomlin
  2026-04-03  1:43   ` Ming Lei
  1 sibling, 1 reply; 29+ messages in thread
From: Keith Busch @ 2026-03-31 23:05 UTC (permalink / raw)
  To: Aaron Tomlin
  Cc: axboe, hch, sagi, mst, aacraid, James.Bottomley, martin.petersen,
	liyihang9, kashyap.desai, sumit.saxena,
	shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
	sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
	jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
	akpm, maz, ruanjinjie, bigeasy, yphbchou0911, wagi, frederic,
	longman, chenridong, hare, kch, ming.lei, steve, sean, chjohnst,
	neelx, mproche, linux-block, linux-kernel, virtualization,
	linux-nvme, linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On Mon, Mar 30, 2026 at 06:10:44PM -0400, Aaron Tomlin wrote:
> +static bool blk_mq_validate(struct blk_mq_queue_map *qmap,
> +			    const struct cpumask *active_hctx)
> +{
> +	/*
> +	 * Verify if the mapping is usable when housekeeping
> +	 * configuration is enabled
> +	 */
> +
> +	for (int queue = 0; queue < qmap->nr_queues; queue++) {
> +		int cpu;
> +
> +		if (cpumask_test_cpu(queue, active_hctx)) {
> +			/*
> +			 * This htcx has at least one online CPU thus it

Typo, should say "hctx".

> +			 * is able to serve any assigned isolated CPU.
> +			 */
> +			continue;
> +		}
> +
> +		/*
> +		 * There is no housekeeping online CPU for this hctx, all
> +		 * good as long as all non houskeeping CPUs are also

Typo, "housekeeping".

...

>  void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
>  {
> -	const struct cpumask *masks;
> +	struct cpumask *masks __free(kfree) = NULL;
> +	const struct cpumask *constraint;
>  	unsigned int queue, cpu, nr_masks;
> +	cpumask_var_t active_hctx;
>  
> -	masks = group_cpus_evenly(qmap->nr_queues, &nr_masks);
> -	if (!masks) {
> -		for_each_possible_cpu(cpu)
> -			qmap->mq_map[cpu] = qmap->queue_offset;
> -		return;
> -	}
> +	if (!zalloc_cpumask_var(&active_hctx, GFP_KERNEL))
> +		goto fallback;
> +
> +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
> +		constraint = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
> +	else
> +		constraint = cpu_possible_mask;
> +
> +	/* Map CPUs to the hardware contexts (hctx) */
> +	masks = group_mask_cpus_evenly(qmap->nr_queues, constraint, &nr_masks);
> +	if (!masks)
> +		goto free_fallback;
>  
>  	for (queue = 0; queue < qmap->nr_queues; queue++) {
> -		for_each_cpu(cpu, &masks[queue % nr_masks])
> -			qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +		unsigned int idx = (qmap->queue_offset + queue) % nr_masks;
> +
> +		for_each_cpu(cpu, &masks[idx]) {
> +			qmap->mq_map[cpu] = idx;

I think there's something off with this when we have multiple queue maps. The
wrapping loses the offset when we've isolated CPUs, so I think the index would
end up incorrect.

Trying this series out when "nvme.poll_queues=2" with isolcpus set, I am
getting a kernel panic:

 nvme nvme0: 8/0/2 default/read/poll queues
 BUG: unable to handle page fault for address: ffff889101898da0
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 4e01067 P4D 4e01067 PUD 0
 Oops: Oops: 0000 [#1] SMP PTI
 CPU: 11 UID: 0 PID: 201 Comm: kworker/u64:19 Not tainted 7.0.0-rc4-00222-g065cad526374 #1586 PREEMPT
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
 Workqueue: async async_run_entry_fn
 RIP: 0010:nvme_init_hctx_common+0x6f/0x190 [nvme]
 Code: 85 78 01 00 00 0f 85 86 00 00 00 45 8b b5 88 01 00 00 4c 89 f0 4d 89 f1 48 c1 e0 04 49 89 c7 4c 8d 94 03 38 0b 00 00 49 01 df <49> 83 bf 40 0b 00 00 00 74 64 44 89 d0 49 81 fa 00 f0 ff ff 77 27
 RSP: 0018:ffffc9000083ba90 EFLAGS: 00010286
 RAX: 0000000ffffffff0 RBX: ffff888101898270 RCX: ffffffffa008bd40
 RDX: 0000000000000008 RSI: ffff888101898270 RDI: ffff888101900800
 RBP: ffffc9000083bac8 R08: 0000000000000060 R09: 00000000ffffffff
 R10: ffff889101898d98 R11: ffff888101ddf000 R12: ffff8881087f36c0
 R13: ffff888101900800 R14: 00000000ffffffff R15: ffff889101898260
 FS:  0000000000000000(0000) GS:ffff8890bb50a000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffff889101898da0 CR3: 0000000101fe8001 CR4: 0000000000770ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  blk_mq_alloc_and_init_hctx+0x11e/0x3a0
  __blk_mq_realloc_hw_ctxs+0x185/0x220
  blk_mq_init_allocated_queue+0xeb/0x3b0
  ? percpu_ref_init+0x6a/0x130
  blk_mq_alloc_queue+0x7a/0xd0
  __blk_mq_alloc_disk+0x14/0x60
  nvme_alloc_ns+0xac/0xb30 [nvme_core]
  ? blk_mq_run_hw_queue+0x117/0x270
  nvme_scan_ns+0x279/0x350 [nvme_core]
  async_run_entry_fn+0x2e/0x130
  process_one_work+0x16c/0x3a0
  worker_thread+0x173/0x2e0
  ? __pfx_worker_thread+0x10/0x10
  kthread+0xe0/0x120
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x207/0x270
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 10/13] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2026-03-31 23:05   ` Keith Busch
@ 2026-04-01 17:16     ` Aaron Tomlin
  0 siblings, 0 replies; 29+ messages in thread
From: Aaron Tomlin @ 2026-04-01 17:16 UTC (permalink / raw)
  To: Keith Busch
  Cc: axboe, hch, sagi, mst, aacraid, James.Bottomley, martin.petersen,
	liyihang9, kashyap.desai, sumit.saxena,
	shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
	sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
	jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
	akpm, maz, ruanjinjie, bigeasy, yphbchou0911, wagi, frederic,
	longman, chenridong, hare, kch, ming.lei, steve, sean, chjohnst,
	neelx, mproche, linux-block, linux-kernel, virtualization,
	linux-nvme, linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

[-- Attachment #1: Type: text/plain, Size: 3208 bytes --]

On Tue, Mar 31, 2026 at 05:05:07PM -0600, Keith Busch wrote:
> > +static bool blk_mq_validate(struct blk_mq_queue_map *qmap,
> > +			    const struct cpumask *active_hctx)
> > +{
> > +	/*
> > +	 * Verify if the mapping is usable when housekeeping
> > +	 * configuration is enabled
> > +	 */
> > +
> > +	for (int queue = 0; queue < qmap->nr_queues; queue++) {
> > +		int cpu;

Hi Keith,

Thank you for taking the time to review and test.

> > +
> > +		if (cpumask_test_cpu(queue, active_hctx)) {
> > +			/*
> > +			 * This htcx has at least one online CPU thus it
> 
> Typo, should say "hctx".

Acknowledged.

> > +			 * is able to serve any assigned isolated CPU.
> > +			 */
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * There is no housekeeping online CPU for this hctx, all
> > +		 * good as long as all non houskeeping CPUs are also
> 
> Typo, "housekeeping".

Acknowledged.

> ...
> 
> >  void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
> >  {
> > -	const struct cpumask *masks;
> > +	struct cpumask *masks __free(kfree) = NULL;
> > +	const struct cpumask *constraint;
> >  	unsigned int queue, cpu, nr_masks;
> > +	cpumask_var_t active_hctx;
> >  
> > -	masks = group_cpus_evenly(qmap->nr_queues, &nr_masks);
> > -	if (!masks) {
> > -		for_each_possible_cpu(cpu)
> > -			qmap->mq_map[cpu] = qmap->queue_offset;
> > -		return;
> > -	}
> > +	if (!zalloc_cpumask_var(&active_hctx, GFP_KERNEL))
> > +		goto fallback;
> > +
> > +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
> > +		constraint = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
> > +	else
> > +		constraint = cpu_possible_mask;
> > +
> > +	/* Map CPUs to the hardware contexts (hctx) */
> > +	masks = group_mask_cpus_evenly(qmap->nr_queues, constraint, &nr_masks);
> > +	if (!masks)
> > +		goto free_fallback;
> >  
> >  	for (queue = 0; queue < qmap->nr_queues; queue++) {
> > -		for_each_cpu(cpu, &masks[queue % nr_masks])
> > -			qmap->mq_map[cpu] = qmap->queue_offset + queue;
> > +		unsigned int idx = (qmap->queue_offset + queue) % nr_masks;
> > +
> > +		for_each_cpu(cpu, &masks[idx]) {
> > +			qmap->mq_map[cpu] = idx;
> 
> I think there's something off with this when we have multiple queue maps. The
> wrapping loses the offset when we've isolated CPUs, so I think the index would
> end up incorrect.
> 
> Trying this series out when "nvme.poll_queues=2" with isolcpus set, I am
> getting a kernel panic:

To resolve this comprehensively, I have made the following adjustments:

    1.  qmap->mq_map now strictly stores the absolute hardware index
        (qmap->queue_offset + queue) across both the primary assignment
        loop and the unassigned CPU fallback loop.

    2.  The active_hctx cpumask now strictly tracks the relative index
        (queue) for the current map being processed, preventing
        out-of-bounds mask tracking.

    3.  The validation check within blk_mq_validate() has been updated to
        correctly compare the absolute index stored in mq_map against the
        offset-adjusted queue index.

I shall fold these corrections directly into the next series.


Best regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 10/13] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2026-03-30 22:10 ` [PATCH v9 10/13] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Aaron Tomlin
  2026-03-31 23:05   ` Keith Busch
@ 2026-04-03  1:43   ` Ming Lei
  1 sibling, 0 replies; 29+ messages in thread
From: Ming Lei @ 2026-04-03  1:43 UTC (permalink / raw)
  To: Aaron Tomlin
  Cc: axboe, kbusch, hch, sagi, mst, aacraid, James.Bottomley,
	martin.petersen, liyihang9, kashyap.desai, sumit.saxena,
	shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
	sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
	jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
	akpm, maz, ruanjinjie, bigeasy, yphbchou0911, wagi, frederic,
	longman, chenridong, hare, kch, steve, sean, chjohnst, neelx,
	mproche, linux-block, linux-kernel, virtualization, linux-nvme,
	linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On Mon, Mar 30, 2026 at 06:10:44PM -0400, Aaron Tomlin wrote:
> From: Daniel Wagner <wagi@kernel.org>
> 
> Extend the capabilities of the generic CPU to hardware queue (hctx)
> mapping code, so it maps houskeeping CPUs and isolated CPUs to the
> hardware queues evenly.
> 
> A hctx is only operational when there is at least one online
> housekeeping CPU assigned (aka active_hctx). Thus, check the final
> mapping that there is no hctx which has only offline housekeeing CPU and
> online isolated CPUs.
> 
> Example mapping result:
> 
>   16 online CPUs
> 
>   isolcpus=io_queue,2-3,6-7,12-13
> 
> Queue mapping:
>         hctx0: default 0 2
>         hctx1: default 1 3
>         hctx2: default 4 6
>         hctx3: default 5 7
>         hctx4: default 8 12
>         hctx5: default 9 13
>         hctx6: default 10
>         hctx7: default 11
>         hctx8: default 14
>         hctx9: default 15
> 
> IRQ mapping:
>         irq 42 affinity 0 effective 0  nvme0q0
>         irq 43 affinity 0 effective 0  nvme0q1
>         irq 44 affinity 1 effective 1  nvme0q2
>         irq 45 affinity 4 effective 4  nvme0q3
>         irq 46 affinity 5 effective 5  nvme0q4
>         irq 47 affinity 8 effective 8  nvme0q5
>         irq 48 affinity 9 effective 9  nvme0q6
>         irq 49 affinity 10 effective 10  nvme0q7
>         irq 50 affinity 11 effective 11  nvme0q8
>         irq 51 affinity 14 effective 14  nvme0q9
>         irq 52 affinity 15 effective 15  nvme0q10
> 
> A corner case is when the number of online CPUs and present CPUs
> differ and the driver asks for less queues than online CPUs, e.g.
> 
>   8 online CPUs, 16 possible CPUs
> 
>   isolcpus=io_queue,2-3,6-7,12-13
>   virtio_blk.num_request_queues=2
> 
> Queue mapping:
>         hctx0: default 0 1 2 3 4 5 6 7 8 12 13
>         hctx1: default 9 10 11 14 15
> 
> IRQ mapping
>         irq 27 affinity 0 effective 0 virtio0-config
>         irq 28 affinity 0-1,4-5,8 effective 5 virtio0-req.0
>         irq 29 affinity 9-11,14-15 effective 0 virtio0-req.1
> 
> Noteworthy is that for the normal/default configuration (!isoclpus) the
> mapping will change for systems which have non hyperthreading CPUs. The
> main assignment loop will completely rely that group_mask_cpus_evenly to
> do the right thing. The old code would distribute the CPUs linearly over
> the hardware context:
> 
> queue mapping for /dev/nvme0n1
>         hctx0: default 0 8
>         hctx1: default 1 9
>         hctx2: default 2 10
>         hctx3: default 3 11
>         hctx4: default 4 12
>         hctx5: default 5 13
>         hctx6: default 6 14
>         hctx7: default 7 15
> 
> The assign each hardware context the map generated by the
> group_mask_cpus_evenly function:
> 
> queue mapping for /dev/nvme0n1
>         hctx0: default 0 1
>         hctx1: default 2 3
>         hctx2: default 4 5
>         hctx3: default 6 7
>         hctx4: default 8 9
>         hctx5: default 10 11
>         hctx6: default 12 13
>         hctx7: default 14 15
> 
> In case of hyperthreading CPUs, the resulting map stays the same.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>  block/blk-mq-cpumap.c | 177 +++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 158 insertions(+), 19 deletions(-)
> 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 8244ecf87835..3b4fa3b291c9 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -22,7 +22,18 @@ static unsigned int blk_mq_num_queues(const struct cpumask *mask,
>  {
>  	unsigned int num;
>  
> -	num = cpumask_weight(mask);
> +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
> +		const struct cpumask *hk_mask;
> +		struct cpumask avail_mask;

This may overflow kernel stack.


Thanks,
Ming



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v9 11/13] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
  2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
                   ` (9 preceding siblings ...)
  2026-03-30 22:10 ` [PATCH v9 10/13] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Aaron Tomlin
@ 2026-03-30 22:10 ` Aaron Tomlin
  2026-03-30 22:10 ` [PATCH v9 12/13] genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs Aaron Tomlin
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Aaron Tomlin @ 2026-03-30 22:10 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, steve, sean, chjohnst, neelx, mproche, linux-block,
	linux-kernel, virtualization, linux-nvme, linux-scsi,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

From: Daniel Wagner <wagi@kernel.org>

When isolcpus=io_queue is enabled, and the last housekeeping CPU for a
given hctx goes offline, there would be no CPU left to handle I/O. To
prevent I/O stalls, prevent offlining housekeeping CPUs that are still
serving isolated CPUs.

When isolcpus=io_queue is enabled and the last housekeeping CPU
for a given hctx goes offline, no CPU would be left to handle I/O.
To prevent I/O stalls, disallow offlining housekeeping CPUs that are
still serving isolated CPUs.

Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 block/blk-mq.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3da2215b2912..8671f2170880 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3699,6 +3699,43 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
 	return data.has_rq;
 }
 
+static bool blk_mq_hctx_can_offline_hk_cpu(struct blk_mq_hw_ctx *hctx,
+					   unsigned int this_cpu)
+{
+	const struct cpumask *hk_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+
+	for (int i = 0; i < hctx->nr_ctx; i++) {
+		struct blk_mq_ctx *ctx = hctx->ctxs[i];
+
+		if (ctx->cpu == this_cpu)
+			continue;
+
+		/*
+		 * Check if this context has at least one online
+		 * housekeeping CPU; in this case the hardware context is
+		 * usable.
+		 */
+		if (cpumask_test_cpu(ctx->cpu, hk_mask) &&
+		    cpu_online(ctx->cpu))
+			break;
+
+		/*
+		 * The context doesn't have any online housekeeping CPUs,
+		 * but there might be an online isolated CPU mapped to
+		 * it.
+		 */
+		if (cpu_is_offline(ctx->cpu))
+			continue;
+
+		pr_warn("%s: trying to offline hctx%d but there is still an online isolcpu CPU %d mapped to it\n",
+			hctx->queue->disk->disk_name,
+			hctx->queue_num, ctx->cpu);
+		return false;
+	}
+
+	return true;
+}
+
 static bool blk_mq_hctx_has_online_cpu(struct blk_mq_hw_ctx *hctx,
 		unsigned int this_cpu)
 {
@@ -3731,6 +3768,11 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 			struct blk_mq_hw_ctx, cpuhp_online);
 	int ret = 0;
 
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
+		if (!blk_mq_hctx_can_offline_hk_cpu(hctx, cpu))
+			return -EINVAL;
+	}
+
 	if (!hctx->nr_ctx || blk_mq_hctx_has_online_cpu(hctx, cpu))
 		return 0;
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 12/13] genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs
  2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
                   ` (10 preceding siblings ...)
  2026-03-30 22:10 ` [PATCH v9 11/13] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Aaron Tomlin
@ 2026-03-30 22:10 ` Aaron Tomlin
  2026-03-30 22:10 ` [PATCH v9 13/13] docs: add io_queue flag to isolcpus Aaron Tomlin
  2026-03-31  1:01 ` [PATCH v9 00/13] blk: honor isolcpus configuration Ming Lei
  13 siblings, 0 replies; 29+ messages in thread
From: Aaron Tomlin @ 2026-03-30 22:10 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, steve, sean, chjohnst, neelx, mproche, linux-block,
	linux-kernel, virtualization, linux-nvme, linux-scsi,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

At present, the managed interrupt spreading algorithm distributes vectors
across all available CPUs within a given node or system. On systems
employing CPU isolation (e.g., "isolcpus=io_queue"), this behaviour
defeats the primary purpose of isolation by routing hardware interrupts
(such as NVMe completion queues) directly to isolated cores.

Update irq_create_affinity_masks() to respect the housekeeping CPU mask.
Introduce irq_spread_hk_filter() to intersect the natively calculated
affinity mask with the HK_TYPE_IO_QUEUE mask, thereby keeping managed
interrupts off isolated CPUs.

To ensure strict isolation whilst guaranteeing a valid routing destination:

    1.  Fallback mechanism: Should the initial spreading logic assign a
        vector exclusively to isolated CPUs (resulting in an empty
        intersection), the filter safely falls back to the system's
        online housekeeping CPUs.

    2.  Hotplug safety: The fallback utilises data_race(cpu_online_mask)
        instead of allocating a local cpumask snapshot. This circumvents
        CONFIG_CPUMASK_OFFSTACK stack bloat hazards on high-core-count
        systems. Furthermore, it prevents deadlocks with concurrent CPU
        hotplug operations (e.g., during storage driver error recovery)
        by eliminating the need to hold the CPU hotplug read lock.

    3.  Fast-path optimisation: The filtering logic is conditionally
        executed only if housekeeping is enabled, thereby ensuring zero
        overhead for standard configurations.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 kernel/irq/affinity.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 076a5ef1e306..dd9e7f5fbdec 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -8,6 +8,24 @@
 #include <linux/slab.h>
 #include <linux/cpu.h>
 #include <linux/group_cpus.h>
+#include <linux/sched/isolation.h>
+
+/**
+ * irq_spread_hk_filter - Restrict an interrupt affinity mask to housekeeping CPUs
+ * @mask:            The interrupt affinity mask to filter (in/out)
+ * @hk_mask:         The system's housekeeping CPU mask
+ *
+ * Intersects @mask with @hk_mask to keep interrupts off isolated CPUs.
+ * If this intersection is empty (meaning all targeted CPUs were isolated),
+ * it falls back to the online housekeeping CPUs to guarantee a valid
+ * routing destination.
+ */
+static void irq_spread_hk_filter(struct cpumask *mask,
+				 const struct cpumask *hk_mask)
+{
+	if (!cpumask_and(mask, mask, hk_mask))
+		cpumask_and(mask, hk_mask, data_race(cpu_online_mask));
+}
 
 static void default_calc_sets(struct irq_affinity *affd, unsigned int affvecs)
 {
@@ -27,6 +45,8 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 {
 	unsigned int affvecs, curvec, usedvecs, i;
 	struct irq_affinity_desc *masks = NULL;
+	const struct cpumask *hk_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+	bool hk_enabled = housekeeping_enabled(HK_TYPE_IO_QUEUE);
 
 	/*
 	 * Determine the number of vectors which need interrupt affinities
@@ -83,8 +103,12 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 			return NULL;
 		}
 
-		for (int j = 0; j < nr_masks; j++)
+		for (int j = 0; j < nr_masks; j++) {
 			cpumask_copy(&masks[curvec + j].mask, &result[j]);
+			if (hk_enabled)
+				irq_spread_hk_filter(&masks[curvec + j].mask,
+						     hk_mask);
+		}
 		kfree(result);
 
 		curvec += nr_masks;
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 13/13] docs: add io_queue flag to isolcpus
  2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
                   ` (11 preceding siblings ...)
  2026-03-30 22:10 ` [PATCH v9 12/13] genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs Aaron Tomlin
@ 2026-03-30 22:10 ` Aaron Tomlin
  2026-03-31  1:01 ` [PATCH v9 00/13] blk: honor isolcpus configuration Ming Lei
  13 siblings, 0 replies; 29+ messages in thread
From: Aaron Tomlin @ 2026-03-30 22:10 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, steve, sean, chjohnst, neelx, mproche, linux-block,
	linux-kernel, virtualization, linux-nvme, linux-scsi,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

From: Daniel Wagner <wagi@kernel.org>

The io_queue flag informs multiqueue device drivers where to place
hardware queues. Document this new flag in the isolcpus
command-line argument description.

Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 .../admin-guide/kernel-parameters.txt         | 22 ++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 03a550630644..9ed7c3ecd158 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2816,7 +2816,6 @@ Kernel parameters
 			  "number of CPUs in system - 1".
 
 			managed_irq
-
 			  Isolate from being targeted by managed interrupts
 			  which have an interrupt mask containing isolated
 			  CPUs. The affinity of managed interrupts is
@@ -2839,6 +2838,27 @@ Kernel parameters
 			  housekeeping CPUs has no influence on those
 			  queues.
 
+			io_queue
+			  Isolate from I/O queue work caused by multiqueue
+			  device drivers. Restrict the placement of
+			  queues to housekeeping CPUs only, ensuring that
+			  all I/O work is processed by a housekeeping CPU.
+
+			  The io_queue configuration takes precedence
+			  over managed_irq. When io_queue is used,
+			  managed_irq placement constrains have no
+			  effect.
+
+			  Note: Offlining housekeeping CPUS which serve
+			  isolated CPUs will be rejected. Isolated CPUs
+			  need to be offlined before offlining the
+			  housekeeping CPUs.
+
+			  Note: When an isolated CPU issues an I/O request,
+			  it is forwarded to a housekeeping CPU. This will
+			  trigger a software interrupt on the completion
+			  path.
+
 			The format of <cpu-list> is described above.
 
 	iucv=		[HW,NET]
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 00/13] blk: honor isolcpus configuration
  2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
                   ` (12 preceding siblings ...)
  2026-03-30 22:10 ` [PATCH v9 13/13] docs: add io_queue flag to isolcpus Aaron Tomlin
@ 2026-03-31  1:01 ` Ming Lei
  2026-03-31 13:38   ` Aaron Tomlin
  13 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2026-03-31  1:01 UTC (permalink / raw)
  To: Aaron Tomlin
  Cc: axboe, kbusch, hch, sagi, mst, aacraid, James.Bottomley,
	martin.petersen, liyihang9, kashyap.desai, sumit.saxena,
	shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
	sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
	jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
	akpm, maz, ruanjinjie, bigeasy, yphbchou0911, wagi, frederic,
	longman, chenridong, hare, kch, steve, sean, chjohnst, neelx,
	mproche, linux-block, linux-kernel, virtualization, linux-nvme,
	linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On Mon, Mar 30, 2026 at 06:10:34PM -0400, Aaron Tomlin wrote:
> Hi Jens, Keith, Christoph, Sagi, Michael,
> 
> I have decided to drive this series forward on behalf of Daniel Wagner, the
> original author. This iteration addresses the outstanding architectural and
> concurrency concerns raised during the previous review cycle, and the series
> has been rebased on v7.0-rc5-509-g545475aebc2a.
> 
> Building upon prior iterations, this series introduces critical
> architectural refinements to the mapping and affinity spreading algorithms
> to guarantee thread safety and resilience against concurrent CPU-hotplug
> operations. Previously, the block layer relied on a shared global static
> mask (i.e., blk_hk_online_mask), which proved vulnerable to race conditions
> during rapid hotplug events. This vulnerability was recently highlighted by
> the kernel test robot, which encountered a NULL pointer dereference during
> rcutorture (cpuhotplug) stress testing due to concurrent mask modification.

Can you share the bug report link?

IMO, it should be solved as a backportable patch first if possible.


Thanks,
Ming



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 00/13] blk: honor isolcpus configuration
  2026-03-31  1:01 ` [PATCH v9 00/13] blk: honor isolcpus configuration Ming Lei
@ 2026-03-31 13:38   ` Aaron Tomlin
  0 siblings, 0 replies; 29+ messages in thread
From: Aaron Tomlin @ 2026-03-31 13:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, kbusch, hch, sagi, mst, aacraid, James.Bottomley,
	martin.petersen, liyihang9, kashyap.desai, sumit.saxena,
	shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
	sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
	jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
	akpm, maz, ruanjinjie, bigeasy, yphbchou0911, wagi, frederic,
	longman, chenridong, hare, kch, steve, sean, chjohnst, neelx,
	mproche, linux-block, linux-kernel, virtualization, linux-nvme,
	linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

[-- Attachment #1: Type: text/plain, Size: 1177 bytes --]

On Tue, Mar 31, 2026 at 09:01:48AM +0800, Ming Lei wrote:
> Can you share the bug report link?
> 
> IMO, it should be solved as a backportable patch first if possible.

Hi Ming,

Thanks for taking a look at the series.

Here is the link to the kernel test robot's bug report:
https://lore.kernel.org/oe-lkp/202509101342.a803ecaa-lkp@intel.com

Regarding a backportable patch, the NULL pointer dereference is actually
not an existing mainline bug. It was a regression introduced specifically
by the previous iteration of this series, specifically by Patch 10:
"blk-mq: use hk cpus only when isolcpus=io_queue is enabled".

The previously implementation introduced the global static
blk_hk_online_mask variable, which caused the race condition during the
robot's cpuhotplug stress tests. Because that flawed logic was confined to
the unmerged v8 patches and never made it into a mainline release, there is
nothing to backport to stable trees.

Version 9 simply drops that global mask approach entirely in favour of
local data_race(cpu_online_mask) snapshots, fixing the architectural flaw
within the series itself.

Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2026-04-04  4:44 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 01/13] scsi: aacraid: use block layer helpers to calculate num of queues Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 02/13] lib/group_cpus: remove dead !SMP code Aaron Tomlin
2026-04-01 12:29   ` Sebastian Andrzej Siewior
2026-04-01 19:31     ` Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 03/13] lib/group_cpus: Add group_mask_cpus_evenly() Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 04/13] genirq/affinity: Add cpumask to struct irq_affinity Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 05/13] blk-mq: add blk_mq_{online|possible}_queue_affinity Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 06/13] nvme-pci: use block layer helpers to constrain queue affinity Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 07/13] scsi: Use " Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 08/13] virtio: blk/scsi: use " Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type Aaron Tomlin
2026-04-01 12:49   ` Sebastian Andrzej Siewior
2026-04-01 19:05     ` Waiman Long
2026-04-02  7:58       ` Sebastian Andrzej Siewior
2026-04-03  1:54         ` Waiman Long
2026-04-01 20:58     ` Aaron Tomlin
2026-04-02  9:09       ` Sebastian Andrzej Siewior
2026-04-03  0:50         ` Aaron Tomlin
2026-04-03  1:20           ` Ming Lei
2026-03-30 22:10 ` [PATCH v9 10/13] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Aaron Tomlin
2026-03-31 23:05   ` Keith Busch
2026-04-01 17:16     ` Aaron Tomlin
2026-04-03  1:43   ` Ming Lei
2026-03-30 22:10 ` [PATCH v9 11/13] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 12/13] genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 13/13] docs: add io_queue flag to isolcpus Aaron Tomlin
2026-03-31  1:01 ` [PATCH v9 00/13] blk: honor isolcpus configuration Ming Lei
2026-03-31 13:38   ` Aaron Tomlin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox