Linux block layer
 help / color / mirror / Atom feed
* [PATCH v15 0/8] blk: honor isolcpus configuration
@ 2026-05-21 23:29 Aaron Tomlin
  2026-05-21 23:29 ` [PATCH v15 1/8] scsi: aacraid: use block layer helpers to calculate num of queues Aaron Tomlin
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Aaron Tomlin @ 2026-05-21 23:29 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, tom.leiming, steve, sean, chjohnst, neelx, mproche,
	nick.lange, marco.crivellari, rishil1999, linux-block,
	linux-kernel

Hi,

I have decided to drive this series forward on behalf of Daniel Wagner, the
original author. The series has been rebased on v7.1-rc4-100-g8bc67e4db64a.

This series introduces a new CPU isolation feature, "isolcpus=io_queue",
designed to protect isolated cores from the disruptive hardware interrupts
generated by high-performance multi-queue devices.

When enabled, it fundamentally alters how the generic IRQ subsystem and the
block layer (blk-mq) map hardware queues:

    1.  Restricted IRQ Affinity: Managed hardware interrupts are strictly
        confined to online housekeeping CPUs.

    2.  Transparent I/O Submission: Applications running on isolated CPUs
        can still seamlessly submit I/O requests; however, the resulting
        hardware completion interrupts are safely routed to a designated
        housekeeping CPU.

    3.  Topology-Aware Queue Allocation: The generic CPU-to-hardware-queue
        mapping logic is extended to distribute hardware contexts evenly
        among the available housekeeping CPUs, preventing MSI-X vector
        exhaustion while maintaining optimal cache locality where possible.

To prevent I/O stalls, the block layer is additionally hardened to reject
hot-plug requests that attempt to offline a housekeeping CPU if it is the
last remaining CPU actively serving an online isolated core.

The complex "top-down" mask plumbing introduced in v12, which modified
struct irq_affinity and expanded block layer APIs, has been abandoned. It
is replaced by a centralised approach: direct isolation querying via
housekeeping_cpumask(HK_TYPE_IO_QUEUE) within the genirq/affinity
subsystem. This architectural simplification successfully decouples core
changes from driver-specific implementations.

Please let me know your thoughts.


Changes since v14:

 - Fixed a division-by-zero by ensuring group_mask_cpus_evenly() safely
   frees its allocations and returns NULL instead of an empty array if the
   provided mask yields zero groups.

 - Fixed a device probe -ENOSPC regression in blk_mq_num_queues(). If the
   housekeeping mask intersection evaluated to 0 (e.g., against a localised
   NUMA node), min_not_zero() would erroneously return the absolute maximum
   hardware queues. The result is now safely clamped to a minimum of 1.

 - Added a mapping verification check to prevent unrelated housekeeping
   CPUs from aborting the global hotplug offline sequence.

 - Aligned the pr_warn format specifier with the unsigned int declaration
   of hctx->queue_num in blk_mq_hctx_can_offline_hk_cpu().

 - Linked to v14: https://lore.kernel.org/lkml/20260520215030.496803-1-atomlin@atomlin.com/

Changes since v13:

 - Removed ineffective data_race() annotations around mask and
   cpu_present_mask pointers. Wrapping the pointers failed to suppress
   KCSAN warnings for the underlying inline bitmap memory accesses.

 - Fixed a silent validation bypass in blk_mq_map_hw_queues() caused by
   overlapping IRQ affinity masks by removing the short-circuiting
   optimisation and evaluating the active_hctx bitmap in a secondary pass.

 - Restored topology-aware multi-queue fallback by correctly routing
   missing IRQ affinity masks to the map_software path instead of the naive
   map-all fallback.

 - Dropped hctx->queue->disk->disk_name from warning to avoid a UAF.

 - Fixed an isolation leak where excess allocated hardware queues were
   improperly padded with irq_default_affinity. Because these queues are
   marked as managed, they bypassed user-space IRQ balancing; they are now
   safely padded with the housekeeping mask.

 - Enforced the housekeeping vector cap prior to evaluating driver-provided
   calc_sets() callbacks, preventing modern multi-queue drivers from
   bypassing the cap and wasting memory on dead queues.

 - Introduced a safety net to the vector calculation to prevent fatal
   -ENOSPC device probe aborts on heavily isolated systems where the
   housekeeping CPU count is lower than the device's structural minimum.

 - Removed an inaccurate claim stating that the io_queue isolation flag
   takes precedence over managed_irq. Both flags are parsed, evaluated, and
   enforced entirely independently by their respective subsystems.

 - Linked to v13: https://lore.kernel.org/lkml/20260513005509.135966-1-atomlin@atomlin.com/

Changes since v12:

 - Resolved TOCTOU race conditions against CPU hotplug events in
   blk_mq_map_queues() and group_mask_cpus_evenly() by taking lockless
   snapshots of the online CPU mask prior to algorithmic evaluation.

 - Migrated the active_hctx tracking to a dynamically sized bitmap
   (bitmap_zalloc), resolving a critical out-of-bounds memory write that
   occurred when hardware queues exceeded the system CPU count.

 - Wrapped the disk pointer fetch in blk_mq_hctx_can_offline_hk_cpu() with
   READ_ONCE() to prevent a TOCTOU NULL pointer dereference against
   concurrent device teardowns.

 - Introduced bitmap_empty() checks to prevent the mapping logic from
   routing unassigned CPUs into unallocated memory when all mapped CPUs are
   offline, safely forcing a fallback mapping instead.

 - Implemented a native two-stage distribution logic in
   group_mask_cpus_evenly() that first prioritises physically present CPUs
   to prevent I/O starvation before distributing remaining vectors to
   non-present CPUs for hotplug safety.

 - Restricted the maximum number of allocated vectors in
   irq_calc_affinity_vectors() to the weight of the housekeeping mask,
   preventing drivers from wasting memory on dead hardware queues that
   physically cannot be routed.

 - Added padding logic using irq_default_affinity for sets where isolation
   constraints yield fewer masks than requested vectors, preserving the 1:1
   hardware queue mapping sequence for subsequent sets.

 - Fixed a logic flaw that prematurely rejected valid offline requests by
   manually iterating over cpu_online_mask and reverse-mapping to
   accurately detect isolated CPUs, properly permitting the offlining of
   non-housekeeping CPUs.

 - Corrected an absolute versus relative queue index calculation bug in
   blk_mq_map_queues() that was overwriting loop iterations, by iterating
   directly over the generated masks.

 - Replaced scoped __free cleanups with traditional goto unwinding in the
   block layer to align with subsystem styling guidelines.

 - Refined the io_queue kernel command-line parameter documentation for
   better clarity and precision.

 - Linked to v12: https://lore.kernel.org/lkml/20260422185215.100929-1-atomlin@atomlin.com/

Changes since v11:

 - Removed duplicate paragraph from the commit message in patch 11
   (Marco Crivellari)

 - Ensure ZERO_SIZE_PTR is not returned by group_mask_cpus_evenly()
   (Marco Crivellari)

 - Linked to v11: https://lore.kernel.org/lkml/20260416192942.1243421-1-atomlin@atomlin.com/

Changes since v10:

 - Completely rewrote the isolcpus=io_queue documentation in
   Documentation/admin-guide/kernel-parameters.txt to clarify its exclusive
   application to managed IRQs, queue allocation limits, vector exhaustion
   prevention, and hardware interrupt routing (Ming Lei)

 - Fixed a stack frame bloat issue by avoiding the on-stack declaration of
   struct cpumask (Waiman Long)

 - Linked to v10: https://lore.kernel.org/linux-nvme/20260401222312.772334-1-atomlin@atomlin.com/

Changes since v9:

 - Fixed a page fault regression encountered when initialising secondary
   queue maps (e.g., NVMe poll queues). Restored the qmap->queue_offset to
   the mq_map assignment to ensure CPUs are strictly mapped to absolute
   hardware indices (Keith Busch)

 - Corrected the active_hctx tracker to utilise relative queue indices,
   preventing out-of-bounds mask assignments

 - Fixed the blk_mq_validate() sanity check to properly evaluate absolute
   queue indices against the offset-adjusted loop index

 - Corrected typographical errors within block/blk-mq-cpumap.c
   (Keith Busch)

 - Clarified the commit message regarding the removal of the !SMP fallback
   code, explicitly noting that the core scheduler now mandates SMP
   unconditionally (Sebastian Andrzej Siewior)

 - Added missing "Signed-off-by:" tags to properly record the patch series
   chain of custody

 - Linked to v9: https://lore.kernel.org/lkml/20260330221047.630206-1-atomlin@atomlin.com/

Changes since v8:

 - Added "Reviewed-by:" tags

 - Introduced irq_spread_hk_filter() to safely restrict managed IRQ
   affinity to housekeeping CPUs (Thomas Gleixner)

 - Removed the unsafe global static variable blk_hk_online_mask from
   blk-mq-cpumap.c and blk-mq.c. blk_mq_online_queue_affinity() now returns
   a stable pointer, delegating safe intersection to the callers to prevent
   concurrent modification races (Thomas Gleixner, Hannes Reinecke)

 - Resolved BUG: kernel NULL pointer dereference in __blk_mq_all_tag_iter
   reported by the kernel test robot during cpuhotplug rcutorture stress
   testing

 - Linked to v8: https://lore.kernel.org/lkml/20250905-isolcpus-io-queues-v8-0-885984c5daca@kernel.org/

Changes since v7:

 - Added commit 524f5eea4bbe ("lib/group_cpus: remove !SMP code")

 - Merged the new mapping logic directly into the existing function to
   avoid special casing

 - Refined the group_mask_cpus_evenly() implementation with the following
   updates:

   - Corrected the function name typo (changed group_masks_cpus_evenly to
     group_mask_cpus_evenly)

   - Updated the documentation comment to accurately reflect the function's
     behavior

   - Renamed the cpu_mask argument to mask for consistency

 - Added a new patch for aacraid to include the missing number of queues
   calculation

 - Restricted updates to only affect SCSI drivers that support
   PCI_IRQ_AFFINITY and do not utilise nvme-fabrics

 - Removed the __free cleanup attribute usage for cpumask_var_t allocations
   due to compatibility issues

 - Updated the documentation to explicitly highlight the limitations
   surrounding CPU offlining

 - Collected accumulated Reviewed-by and Acked-by tags

 - Linked to v7: https://patch.msgid.link/20250702-isolcpus-io-queues-v7-0-557aa7eacce4@kernel.org

Changes since v6:

 - Sent out the first part of the series independently:
   https://lore.kernel.org/all/20250617-isolcpus-queue-counters-v1-0-13923686b54b@kernel.org/

 - Added comprehensive kernel command-line documentation

 - Added validation logic to ensure the resulting CPU-to-queue mapping is
   fully operational

 - Rewrote the isolcpus mapping code to properly account for active
   hardware contexts (hctx)

 - Introduced blk_mq_map_hk_irq_queues, which utilizes the mask retrieved
   from irq_get_affinity()

 - Refactored blk_mq_map_hk_queues to require the caller to explicitly test
   for HK_TYPE_MANAGED_IRQ

 - Linked to v6: https://patch.msgid.link/20250424-isolcpus-io-queues-v6-0-9a53a870ca1f@kernel.org

Changes since v5:

 - Reintroduced the io_queue type for the isolcpus kernel parameter

 - Prevented the offlining of a housekeeping CPU if an isolated CPU is
   still present, upgrading this behavior from a simple warning to a hard
   restriction

 - Linked to v5: https://lore.kernel.org/r/20250110-isolcpus-io-queues-v5-0-0e4f118680b0@kernel.org

Changes since v4:

 - Rebased the series onto the latest for-6.14/block branch.

 - Updated the documentation regarding the managed_irq parameters

 - Reworded the commit message for "blk-mq: issue warning when offlining
   hctx with online isolcpus" for better clarity

 - Split the input and output parameters in the patch "lib/group_cpus: let
   group_cpu_evenly return number of groups"

 - Dropped the patch "sched/isolation: document HK_TYPE housekeeping
   option"

 - Linked to v4: https://lore.kernel.org/r/20241217-isolcpus-io-queues-v4-0-5d355fbb1e14@kernel.org

Changes since v3:

 - Added the patch "blk-mq: issue warning when offlining hctx with online
   isolcpus"

 - Fixed the check in group_cpus_evenly(); the condition now properly uses
   housekeeping_enabled() instead of cpumask_weight(), as the latter always
   returns a valid mask

 - Dropped the Fixes: tag from "lib/group_cpus.c: honor housekeeping config
   when grouping CPUs"

 - Fixed an overlong line warning in the patch "scsi: use block layer
   helpers to calculate num of queues"

 - Dropped the patch "sched/isolation: Add io_queue housekeeping option" in
   favor of simply documenting the housekeeping hk_type enum

 - Added the patch "lib/group_cpus: let group_cpu_evenly return number of
   groups"

 - Collected accumulated Reviewed-by and Acked-by tags

 - Split the patchset by moving foundational changes into a separate
   preparation series:
   https://lore.kernel.org/linux-nvme/20241202-refactor-blk-affinity-helpers-v6-0-27211e9c2cd5@kernel.org/

 - Linked to v3: https://lore.kernel.org/r/20240806-isolcpus-io-queues-v3-0-da0eecfeaf8b@suse.de

Changes since v2:

 - Integrated patches from Ming Lei
   (https://lore.kernel.org/all/20210709081005.421340-1-ming.lei@redhat.com/):
   "virtio: add APIs for retrieving vq affinity" and "blk-mq: introduce
   blk_mq_dev_map_queues"

 - Replaced all instances of blk_mq_pci_map_queues and
   blk_mq_virtio_map_queues with the new unified blk_mq_dev_map_queues

 - Updated and expanded the helper functions used for calculating the
   number of queues

 - Added the CPU-to-hctx mapping function specifically to support the
   isolcpus=io_queue parameter

 - Documented the hk_type enum and the newly introduced isolcpus=io_queue
   parameter

 - Added the patch "scsi: pm8001: do not overwrite PCI queue mapping"

 - Linked to v2: https://lore.kernel.org/r/20240627-isolcpus-io-queues-v2-0-26a32e3c4f75@suse.de

Changes since v1:

 - Updated the feature documentation for clarity and completeness

 - Split the blk/nvme-pci patch into smaller, logical commits

 - Dropped the HK_TYPE_IO_QUEUE macro in favor of reusing
   HK_TYPE_MANAGED_IRQ

 - Linked to v1: https://lore.kernel.org/r/20240621-isolcpus-io-queues-v1-0-8b169bf41083@suse.de


Aaron Tomlin (1):
  genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs

Daniel Wagner (7):
  scsi: aacraid: use block layer helpers to calculate num of queues
  lib/group_cpus: remove dead !SMP code
  lib/group_cpus: Add group_mask_cpus_evenly()
  isolation: Introduce io_queue isolcpus type
  blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
  docs: add io_queue flag to isolcpus

 .../admin-guide/kernel-parameters.txt         |  26 +-
 block/blk-mq-cpumap.c                         | 238 ++++++++++++++++--
 block/blk-mq.c                                |  55 ++++
 drivers/scsi/aacraid/comminit.c               |   3 +-
 include/linux/group_cpus.h                    |   3 +
 include/linux/sched/isolation.h               |   1 +
 kernel/irq/affinity.c                         |  31 ++-
 kernel/sched/isolation.c                      |   7 +
 lib/group_cpus.c                              | 112 ++++++++-
 9 files changed, 436 insertions(+), 40 deletions(-)


base-commit: 8bc67e4db64aa72732c474b44ea8622062c903f0
-- 
2.51.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v15 1/8] scsi: aacraid: use block layer helpers to calculate num of queues
  2026-05-21 23:29 [PATCH v15 0/8] blk: honor isolcpus configuration Aaron Tomlin
@ 2026-05-21 23:29 ` Aaron Tomlin
  2026-05-21 23:29 ` [PATCH v15 2/8] lib/group_cpus: remove dead !SMP code Aaron Tomlin
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Aaron Tomlin @ 2026-05-21 23:29 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, tom.leiming, steve, sean, chjohnst, neelx, mproche,
	nick.lange, marco.crivellari, rishil1999, linux-block,
	linux-kernel

From: Daniel Wagner <wagi@kernel.org>

The calculation of the upper limit for queues does not depend solely on
the number of online CPUs; for example, the isolcpus kernel
command-line option must also be considered.

To account for this, the block layer provides a helper function to
retrieve the maximum number of queues. Use it to set an appropriate
upper queue number limit.

This patch brings aacraid in line with the API migration initiated for
other SCSI drivers in commit 94970cfb5f10 ("scsi: use block layer
helpers to calculate num of queues").

Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
[atomlin: Drop "Fixes:" tag; indicate alignment with other SCSI drivers]
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 drivers/scsi/aacraid/comminit.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/scsi/aacraid/comminit.c b/drivers/scsi/aacraid/comminit.c
index 9bd3f5b868bc..ec165b57182d 100644
--- a/drivers/scsi/aacraid/comminit.c
+++ b/drivers/scsi/aacraid/comminit.c
@@ -469,8 +469,7 @@ void aac_define_int_mode(struct aac_dev *dev)
 	}
 
 	/* Don't bother allocating more MSI-X vectors than cpus */
-	msi_count = min(dev->max_msix,
-		(unsigned int)num_online_cpus());
+	msi_count = blk_mq_num_online_queues(dev->max_msix);
 
 	dev->max_msix = msi_count;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v15 2/8] lib/group_cpus: remove dead !SMP code
  2026-05-21 23:29 [PATCH v15 0/8] blk: honor isolcpus configuration Aaron Tomlin
  2026-05-21 23:29 ` [PATCH v15 1/8] scsi: aacraid: use block layer helpers to calculate num of queues Aaron Tomlin
@ 2026-05-21 23:29 ` Aaron Tomlin
  2026-05-21 23:29 ` [PATCH v15 3/8] lib/group_cpus: Add group_mask_cpus_evenly() Aaron Tomlin
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Aaron Tomlin @ 2026-05-21 23:29 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, tom.leiming, steve, sean, chjohnst, neelx, mproche,
	nick.lange, marco.crivellari, rishil1999, linux-block,
	linux-kernel

From: Daniel Wagner <wagi@kernel.org>

The core scheduler recently transitioned to compiling SMP data
structures unconditionally to reduce code complexity - see commit
cac5cefbade9 ("sched/smp: Make SMP unconditional").

In alignment with this philosophy of reducing dual-path maintenance,
this patch removes the #ifdef CONFIG_SMP guards and the dedicated !SMP
fallback logic here.

While the !SMP path provided a slightly simpler execution flow for
uniprocessor kernels (avoiding SMP-specific overhead), maintaining these
separate code paths adds unnecessary complexity and testing burden.
Removing these guards simplifies the codebase by standardizing entirely
on the SMP logic, which safely resolves to single-CPU operations on UP
configurations.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
[atomlin: Updated commit message to clarify !SMP removal context]
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 lib/group_cpus.c | 20 --------------------
 1 file changed, 20 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index e6e18d7a49bb..b8d54398f88a 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -9,8 +9,6 @@
 #include <linux/sort.h>
 #include <linux/group_cpus.h>
 
-#ifdef CONFIG_SMP
-
 static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 				unsigned int cpus_per_grp)
 {
@@ -564,22 +562,4 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
 	*nummasks = min(nr_present + nr_others, numgrps);
 	return masks;
 }
-#else /* CONFIG_SMP */
-struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
-{
-	struct cpumask *masks;
-
-	if (numgrps == 0)
-		return NULL;
-
-	masks = kzalloc_objs(*masks, numgrps);
-	if (!masks)
-		return NULL;
-
-	/* assign all CPUs(cpu 0) to the 1st group only */
-	cpumask_copy(&masks[0], cpu_possible_mask);
-	*nummasks = 1;
-	return masks;
-}
-#endif /* CONFIG_SMP */
 EXPORT_SYMBOL_GPL(group_cpus_evenly);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v15 3/8] lib/group_cpus: Add group_mask_cpus_evenly()
  2026-05-21 23:29 [PATCH v15 0/8] blk: honor isolcpus configuration Aaron Tomlin
  2026-05-21 23:29 ` [PATCH v15 1/8] scsi: aacraid: use block layer helpers to calculate num of queues Aaron Tomlin
  2026-05-21 23:29 ` [PATCH v15 2/8] lib/group_cpus: remove dead !SMP code Aaron Tomlin
@ 2026-05-21 23:29 ` Aaron Tomlin
  2026-05-21 23:29 ` [PATCH v15 4/8] isolation: Introduce io_queue isolcpus type Aaron Tomlin
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Aaron Tomlin @ 2026-05-21 23:29 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, tom.leiming, steve, sean, chjohnst, neelx, mproche,
	nick.lange, marco.crivellari, rishil1999, linux-block,
	linux-kernel

From: Daniel Wagner <wagi@kernel.org>

This commit introduces group_mask_cpus_evenly(), which allows callers to
distribute a specific CPU mask evenly across groups. It serves as a bounded
version of group_cpus_evenly().

While group_cpus_evenly() operates on the global cpu_possible_mask,
group_mask_cpus_evenly() confines the distribution strictly within the
boundaries of the caller-provided mask. It preserves the kernel's native
two-stage spreading logic-first prioritising CPUs that are physically
present (cpu_present_mask) to prevent I/O starvation, and then distributing
any remaining vectors to non-present CPUs to maintain hotplug safety.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
[atomlin:
    - Added check for numgrps == 0
    - Updated commit message to resolve typo
    - Removed unused <linux/sched/isolation.h>
    - Fix TOCTOU race by caching the provided mask
    - Removed ineffective data_race() annotations around cpumask pointers
    - Implemented two-stage grouping logic to prioritise physically
      present CPUs, mirroring group_cpus_evenly()
    - Fix division-by-zero bug by ensuring group_mask_cpus_evenly()
      returns NULL instead of an empty array when evaluated against an
      empty mask]
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 include/linux/group_cpus.h |   3 +
 lib/group_cpus.c           | 110 +++++++++++++++++++++++++++++++++++++
 2 files changed, 113 insertions(+)

diff --git a/include/linux/group_cpus.h b/include/linux/group_cpus.h
index 9d4e5ab6c314..defab4123a82 100644
--- a/include/linux/group_cpus.h
+++ b/include/linux/group_cpus.h
@@ -10,5 +10,8 @@
 #include <linux/cpu.h>
 
 struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks);
+struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
+				       const struct cpumask *mask,
+				       unsigned int *nummasks);
 
 #endif
diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index b8d54398f88a..75bd082e00bf 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -563,3 +563,113 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks)
 	return masks;
 }
 EXPORT_SYMBOL_GPL(group_cpus_evenly);
+
+/**
+ * group_mask_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * @numgrps: number of cpumasks to create
+ * @mask: CPUs to consider for the grouping
+ * @nummasks: number of initialized cpumasks
+ *
+ * Return: cpumask array if successful, NULL otherwise. Only the CPUs
+ * marked in the mask will be considered for the grouping. And each
+ * element includes CPUs assigned to this group. nummasks contains the
+ * number of initialized masks which can be less than numgrps.
+ *
+ * Try to put close CPUs from viewpoint of CPU and NUMA locality into
+ * the same group.
+ *
+ * We guarantee in the resulting grouping that all CPUs specified in the
+ * provided mask are covered, and no same CPU is assigned to multiple
+ * groups.
+ */
+struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
+				       const struct cpumask *mask,
+				       unsigned int *nummasks)
+{
+	unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
+	cpumask_var_t *node_to_cpumask;
+	cpumask_var_t nmsk, local_mask, npresmsk;
+	int ret = -ENOMEM;
+	struct cpumask *masks = NULL;
+
+	if (numgrps == 0)
+		return NULL;
+
+	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+		return NULL;
+
+	if (!zalloc_cpumask_var(&local_mask, GFP_KERNEL))
+		goto fail_nmsk;
+
+	if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
+		goto fail_local_mask;
+
+	node_to_cpumask = alloc_node_to_cpumask();
+	if (!node_to_cpumask)
+		goto fail_npresmsk;
+
+	masks = kzalloc_objs(*masks, numgrps);
+	if (!masks)
+		goto fail_node_to_cpumask;
+
+	build_node_to_cpumask(node_to_cpumask);
+
+	/*
+	 * Create a stable snapshot of the mask. The grouping algorithm
+	 * requires the CPU count to remain constant across its multiple
+	 * passes. This prevents allocation failures if the caller passes a
+	 * dynamic mask (e.g., cpu_online_mask) that changes concurrently.
+	 */
+	cpumask_copy(local_mask, mask);
+
+	/*
+	 * Grouping present CPUs first. We intersect the provided mask with
+	 * cpu_present_mask to ensure that we prioritise physically
+	 * available CPUs for the initial distribution.
+	 */
+	cpumask_and(npresmsk, local_mask, cpu_present_mask);
+	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
+				  npresmsk, nmsk, masks);
+	if (ret < 0)
+		goto fail_node_to_cpumask;
+	nr_present = ret;
+
+	/*
+	 * Allocate non-present CPUs starting from the next group to be
+	 * handled. If the grouping of present CPUs already exhausted the
+	 * group space, assign the non-present CPUs to the already
+	 * allocated out groups.
+	 */
+	if (nr_present >= numgrps)
+		curgrp = 0;
+	else
+		curgrp = nr_present;
+	cpumask_andnot(npresmsk, local_mask, npresmsk);
+	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
+				  npresmsk, nmsk, masks);
+	if (ret >= 0)
+		nr_others = ret;
+
+fail_node_to_cpumask:
+	free_node_to_cpumask(node_to_cpumask);
+
+fail_npresmsk:
+	free_cpumask_var(npresmsk);
+
+fail_local_mask:
+	free_cpumask_var(local_mask);
+
+fail_nmsk:
+	free_cpumask_var(nmsk);
+	if (ret < 0) {
+		kfree(masks);
+		return NULL;
+	}
+	*nummasks = min(nr_present + nr_others, numgrps);
+	if (*nummasks == 0) {
+		kfree(masks);
+		return NULL;
+	}
+	return masks;
+}
+EXPORT_SYMBOL_GPL(group_mask_cpus_evenly);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v15 4/8] isolation: Introduce io_queue isolcpus type
  2026-05-21 23:29 [PATCH v15 0/8] blk: honor isolcpus configuration Aaron Tomlin
                   ` (2 preceding siblings ...)
  2026-05-21 23:29 ` [PATCH v15 3/8] lib/group_cpus: Add group_mask_cpus_evenly() Aaron Tomlin
@ 2026-05-21 23:29 ` Aaron Tomlin
  2026-05-21 23:29 ` [PATCH v15 5/8] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Aaron Tomlin
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Aaron Tomlin @ 2026-05-21 23:29 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, tom.leiming, steve, sean, chjohnst, neelx, mproche,
	nick.lange, marco.crivellari, rishil1999, linux-block,
	linux-kernel

From: Daniel Wagner <wagi@kernel.org>

Multiqueue drivers spread I/O queues across all CPUs for optimal
performance. However, these drivers are not aware of CPU isolation
requirements and will distribute queues without considering the isolcpus
configuration.

Introduce a new isolcpus mask that allows users to define which CPUs
should have I/O queues assigned. This is similar to managed_irq, but
intended for drivers that do not use the managed IRQ infrastructure

Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 include/linux/sched/isolation.h | 1 +
 kernel/sched/isolation.c        | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index cf0fd03dd7a2..30cb9a44365e 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -18,6 +18,7 @@ enum hk_type {
 	HK_TYPE_MANAGED_IRQ,
 	/* Inverse of boot-time nohz_full= or isolcpus=nohz arguments */
 	HK_TYPE_KERNEL_NOISE,
+	HK_TYPE_IO_QUEUE,
 	HK_TYPE_MAX,
 
 	/*
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index ef152d401fe2..3406e3024fd4 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -16,6 +16,7 @@ enum hk_flags {
 	HK_FLAG_DOMAIN		= BIT(HK_TYPE_DOMAIN),
 	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
 	HK_FLAG_KERNEL_NOISE	= BIT(HK_TYPE_KERNEL_NOISE),
+	HK_FLAG_IO_QUEUE	= BIT(HK_TYPE_IO_QUEUE),
 };
 
 DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
@@ -340,6 +341,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
 			continue;
 		}
 
+		if (!strncmp(str, "io_queue,", 9)) {
+			str += 9;
+			flags |= HK_FLAG_IO_QUEUE;
+			continue;
+		}
+
 		/*
 		 * Skip unknown sub-parameter and validate that it is not
 		 * containing an invalid character.
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v15 5/8] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2026-05-21 23:29 [PATCH v15 0/8] blk: honor isolcpus configuration Aaron Tomlin
                   ` (3 preceding siblings ...)
  2026-05-21 23:29 ` [PATCH v15 4/8] isolation: Introduce io_queue isolcpus type Aaron Tomlin
@ 2026-05-21 23:29 ` Aaron Tomlin
  2026-05-21 23:29 ` [PATCH v15 6/8] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Aaron Tomlin
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Aaron Tomlin @ 2026-05-21 23:29 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, tom.leiming, steve, sean, chjohnst, neelx, mproche,
	nick.lange, marco.crivellari, rishil1999, linux-block,
	linux-kernel

From: Daniel Wagner <wagi@kernel.org>

Extend the capabilities of the generic CPU to hardware queue (hctx)
mapping code, so it maps houskeeping CPUs and isolated CPUs to the
hardware queues evenly.

A hctx is only operational when there is at least one online
housekeeping CPU assigned (aka active_hctx). Thus, check the final
mapping that there is no hctx which has only offline housekeeing CPU and
online isolated CPUs.

Example mapping result:

  16 online CPUs

  isolcpus=io_queue,2-3,6-7,12-13

Queue mapping:
        hctx0: default 0 2
        hctx1: default 1 3
        hctx2: default 4 6
        hctx3: default 5 7
        hctx4: default 8 12
        hctx5: default 9 13
        hctx6: default 10
        hctx7: default 11
        hctx8: default 14
        hctx9: default 15

IRQ mapping:
        irq 42 affinity 0 effective 0  nvme0q0
        irq 43 affinity 0 effective 0  nvme0q1
        irq 44 affinity 1 effective 1  nvme0q2
        irq 45 affinity 4 effective 4  nvme0q3
        irq 46 affinity 5 effective 5  nvme0q4
        irq 47 affinity 8 effective 8  nvme0q5
        irq 48 affinity 9 effective 9  nvme0q6
        irq 49 affinity 10 effective 10  nvme0q7
        irq 50 affinity 11 effective 11  nvme0q8
        irq 51 affinity 14 effective 14  nvme0q9
        irq 52 affinity 15 effective 15  nvme0q10

A corner case is when the number of online CPUs and present CPUs
differ and the driver asks for less queues than online CPUs, e.g.

  8 online CPUs, 16 possible CPUs

  isolcpus=io_queue,2-3,6-7,12-13
  virtio_blk.num_request_queues=2

Queue mapping:
        hctx0: default 0 1 2 3 4 5 6 7 8 12 13
        hctx1: default 9 10 11 14 15

IRQ mapping
        irq 27 affinity 0 effective 0 virtio0-config
        irq 28 affinity 0-1,4-5,8 effective 5 virtio0-req.0
        irq 29 affinity 9-11,14-15 effective 0 virtio0-req.1

Noteworthy is that for the normal/default configuration (!isoclpus) the
mapping will change for systems which have non hyperthreading CPUs. The
main assignment loop will completely rely that group_mask_cpus_evenly to
do the right thing. The old code would distribute the CPUs linearly over
the hardware context:

queue mapping for /dev/nvme0n1
        hctx0: default 0 8
        hctx1: default 1 9
        hctx2: default 2 10
        hctx3: default 3 11
        hctx4: default 4 12
        hctx5: default 5 13
        hctx6: default 6 14
        hctx7: default 7 15

The assign each hardware context the map generated by the
group_mask_cpus_evenly function:

queue mapping for /dev/nvme0n1
        hctx0: default 0 1
        hctx1: default 2 3
        hctx2: default 4 5
        hctx3: default 6 7
        hctx4: default 8 9
        hctx5: default 10 11
        hctx6: default 12 13
        hctx7: default 14 15

In case of hyperthreading CPUs, the resulting map stays the same.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
[atomlin:
    - Updated blk_mq_validate() to use test_bit() for the new bitmap
    - Replaced __free cleanups with traditional goto unwinding to align
      with subsystem styling
    - Updated blk_mq_map_fallback() to use qmap->queue_offset ensuring
      secondary maps do not incorrectly route to the primary default map
    - Added a bitmap_empty() check to prevent out-of-bounds CPU routing
      when all mapped CPUs are offline
    - Migrated active_hctx to a dynamically sized bitmap to fix an
      out-of-bounds write when hardware queues exceed the system CPU
      count
    - Fixed absolute vs. relative hardware queue index mix-up in
      blk_mq_map_queues() and validation checks
    - Fixed typographical errors
    - Reduced stack frame size of blk_mq_num_queues()
    - Resolved a TOCTOU race against CPU hotplug events by snapshotting
      cpu_online_mask to ensure mapping and validation phases agree
    - Corrected a loop overwrite bug in blk_mq_map_queues() by iterating
      directly over masks to prevent orphaned queues from being activated
    - Restored topology-aware multi-queue fallback in
      blk_mq_map_hw_queues() by correctly routing missing IRQ affinity
      masks to the map_software path instead of the naive fallback
    - Fixed a silent validation bypass in blk_mq_map_hw_queues() caused by
      overlapping IRQ affinity masks by evaluating the active_hctx bitmap
      in a secondary pass
    - Hardened isolation logic in blk_mq_map_hw_queues() to require online
      housekeeping CPUs before marking a hardware queue as active
    - Enforce safe fallback of 1 when the intersection evaluates to 0]
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 block/blk-mq-cpumap.c | 238 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 220 insertions(+), 18 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 705da074ad6c..efb02655f59e 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -22,8 +22,15 @@ static unsigned int blk_mq_num_queues(const struct cpumask *mask,
 {
 	unsigned int num;
 
-	num = cpumask_weight(mask);
-	return min_not_zero(num, max_queues);
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
+		num = cpumask_weight_and(mask, housekeeping_cpumask(HK_TYPE_IO_QUEUE));
+	else
+		num = cpumask_weight(mask);
+	/*
+	 * Ensure that a count of zero does not inadvertently result in
+	 * allocating the maximum number of queues.
+	 */
+	return min_not_zero(num ?: 1U, max_queues);
 }
 
 /**
@@ -33,7 +40,8 @@ static unsigned int blk_mq_num_queues(const struct cpumask *mask,
  *		ignored.
  *
  * Calculates the number of queues to be used for a multiqueue
- * device based on the number of possible CPUs.
+ * device based on the number of possible CPUs. This helper
+ * takes isolcpus settings into account.
  */
 unsigned int blk_mq_num_possible_queues(unsigned int max_queues)
 {
@@ -48,7 +56,8 @@ EXPORT_SYMBOL_GPL(blk_mq_num_possible_queues);
  *		ignored.
  *
  * Calculates the number of queues to be used for a multiqueue
- * device based on the number of online CPUs.
+ * device based on the number of online CPUs. This helper
+ * takes isolcpus settings into account.
  */
 unsigned int blk_mq_num_online_queues(unsigned int max_queues)
 {
@@ -56,23 +65,139 @@ unsigned int blk_mq_num_online_queues(unsigned int max_queues)
 }
 EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
 
+static bool blk_mq_validate(struct blk_mq_queue_map *qmap,
+			    const unsigned long *active_hctx,
+			    const struct cpumask *online_mask)
+{
+	/*
+	 * Verify if the mapping is usable when housekeeping
+	 * configuration is enabled
+	 */
+	for (int queue = 0; queue < qmap->nr_queues; queue++) {
+		int cpu;
+
+		if (test_bit(queue, active_hctx)) {
+			/*
+			 * This hctx has at least one online CPU thus it
+			 * is able to serve any assigned isolated CPU.
+			 */
+			continue;
+		}
+
+		/*
+		 * There is no housekeeping online CPU for this hctx, all
+		 * good as long as all non-housekeeping CPUs are also
+		 * offline.
+		 */
+		for_each_cpu(cpu, online_mask) {
+			if (qmap->mq_map[cpu] != qmap->queue_offset + queue)
+				continue;
+
+			pr_warn("Unable to create a usable CPU-to-queue mapping with the given constraints\n");
+			return false;
+		}
+	}
+
+	return true;
+}
+
+static void blk_mq_map_fallback(struct blk_mq_queue_map *qmap)
+{
+	unsigned int cpu;
+
+	/*
+	 * Map all CPUs to the first hctx of this specific map to ensure
+	 * at least one online CPU is serving it, respecting the map's
+	 * boundaries so secondary maps do not route into the default map.
+	 */
+	for_each_possible_cpu(cpu)
+		qmap->mq_map[cpu] = qmap->queue_offset;
+}
+
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
-	const struct cpumask *masks;
+	struct cpumask *masks;
+	const struct cpumask *constraint;
 	unsigned int queue, cpu, nr_masks;
+	unsigned long *active_hctx;
+	cpumask_var_t online_mask;
 
-	masks = group_cpus_evenly(qmap->nr_queues, &nr_masks);
-	if (!masks) {
-		for_each_possible_cpu(cpu)
-			qmap->mq_map[cpu] = qmap->queue_offset;
-		return;
-	}
+	active_hctx = bitmap_zalloc(qmap->nr_queues, GFP_KERNEL);
+	if (!active_hctx)
+		goto fallback;
 
-	for (queue = 0; queue < qmap->nr_queues; queue++) {
-		for_each_cpu(cpu, &masks[queue % nr_masks])
+	if (!alloc_cpumask_var(&online_mask, GFP_KERNEL))
+		goto free_fallback_hctx;
+
+	/*
+	 * Snapshot online CPUs to prevent TOCTOU races between the
+	 * mapping phase and the validation phase.
+	 */
+	cpumask_copy(online_mask, cpu_online_mask);
+
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
+		constraint = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+	else
+		constraint = cpu_possible_mask;
+
+	/* Map CPUs to the hardware contexts (hctx) */
+	masks = group_mask_cpus_evenly(qmap->nr_queues, constraint, &nr_masks);
+	if (!masks)
+		goto free_fallback;
+
+	/*
+	 * Iterate directly over the generated CPU masks.
+	 * Calculate the final, highest hardware queue index that maps to this
+	 * mask. This skips all intermediate overwrites and safely evaluates
+	 * active_hctx only for queues that survive the mapping.
+	 */
+	for (unsigned int idx = 0; idx < nr_masks; idx++) {
+		bool active = false;
+		queue = qmap->nr_queues - 1 -
+			((qmap->nr_queues - 1 - idx) % nr_masks);
+
+		for_each_cpu(cpu, &masks[idx]) {
 			qmap->mq_map[cpu] = qmap->queue_offset + queue;
+
+			if (!active && cpumask_test_cpu(cpu, online_mask)) {
+				__set_bit(queue, active_hctx);
+				active = true;
+			}
+		}
+	}
+
+	/*
+	 * If all CPUs in the generated masks are offline, the active_hctx
+	 * bitmap will be empty. Attempting to route unassigned CPUs to an
+	 * empty bitmap will map them out-of-bounds. Fall back instead.
+	 */
+	if (bitmap_empty(active_hctx, qmap->nr_queues))
+		goto free_fallback;
+
+	/* Map any unassigned CPU evenly to the hardware contexts (hctx) */
+	queue = find_first_bit(active_hctx, qmap->nr_queues);
+	for_each_cpu_andnot(cpu, cpu_possible_mask, constraint) {
+		qmap->mq_map[cpu] = qmap->queue_offset + queue;
+		queue = find_next_bit_wrap(active_hctx, qmap->nr_queues, queue + 1);
 	}
+
+	if (!blk_mq_validate(qmap, active_hctx, online_mask))
+		goto free_fallback;
+
 	kfree(masks);
+	free_cpumask_var(online_mask);
+	bitmap_free(active_hctx);
+
+	return;
+
+free_fallback:
+	kfree(masks);
+	free_cpumask_var(online_mask);
+free_fallback_hctx:
+	bitmap_free(active_hctx);
+
+fallback:
+	blk_mq_map_fallback(qmap);
 }
 EXPORT_SYMBOL_GPL(blk_mq_map_queues);
 
@@ -109,24 +234,101 @@ void blk_mq_map_hw_queues(struct blk_mq_queue_map *qmap,
 			  struct device *dev, unsigned int offset)
 
 {
-	const struct cpumask *mask;
+	cpumask_var_t mask, online_mask;
+	const struct cpumask *constraint;
+	unsigned long *active_hctx;
 	unsigned int queue, cpu;
 
 	if (!dev->bus->irq_get_affinity)
+		goto map_software;
+
+	active_hctx = bitmap_zalloc(qmap->nr_queues, GFP_KERNEL);
+	if (!active_hctx)
+		goto fallback;
+
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
+		bitmap_free(active_hctx);
 		goto fallback;
+	}
+
+	if (!alloc_cpumask_var(&online_mask, GFP_KERNEL))
+		goto free_fallback_mask;
+
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
+		constraint = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+	else
+		constraint = cpu_possible_mask;
 
+	/*
+	 * Snapshot online CPUs to prevent TOCTOU races between the
+	 * mapping phase and the validation phase.
+	 */
+	cpumask_copy(online_mask, cpu_online_mask);
+
+	/* Map CPUs to the hardware contexts (hctx) */
 	for (queue = 0; queue < qmap->nr_queues; queue++) {
-		mask = dev->bus->irq_get_affinity(dev, queue + offset);
-		if (!mask)
-			goto fallback;
+		const struct cpumask *affinity_mask;
+
+		affinity_mask = dev->bus->irq_get_affinity(dev, offset + queue);
+		if (!affinity_mask)
+			goto free_map_software;
 
-		for_each_cpu(cpu, mask)
+		for_each_cpu(cpu, affinity_mask) {
 			qmap->mq_map[cpu] = qmap->queue_offset + queue;
+			cpumask_set_cpu(cpu, mask);
+		}
+	}
+
+	/*
+	 * Evaluate active_hctx after mapping to handle overlapping masks.
+	 * This ensures queues that were overwritten do not falsely pass validation.
+	 */
+	for_each_cpu(cpu, mask) {
+		if (cpumask_test_cpu(cpu, online_mask) &&
+			cpumask_test_cpu(cpu, constraint)) {
+			queue = qmap->mq_map[cpu] - qmap->queue_offset;
+			__set_bit(queue, active_hctx);
+		}
+	}
+
+	/*
+	 * If all CPUs assigned to this map are offline, the bitmap will
+	 * be empty. Fall back instead of routing out of bounds.
+	 */
+	if (bitmap_empty(active_hctx, qmap->nr_queues))
+		goto free_fallback;
+
+	/* Map any unassigned CPU evenly to the hardware contexts (hctx) */
+	queue = find_first_bit(active_hctx, qmap->nr_queues);
+	for_each_cpu_andnot(cpu, cpu_possible_mask, mask) {
+		qmap->mq_map[cpu] = qmap->queue_offset + queue;
+		queue = find_next_bit_wrap(active_hctx, qmap->nr_queues, queue + 1);
 	}
 
+	if (!blk_mq_validate(qmap, active_hctx, online_mask))
+		goto free_fallback;
+
+	bitmap_free(active_hctx);
+	free_cpumask_var(mask);
+	free_cpumask_var(online_mask);
+
 	return;
 
+free_fallback:
+	free_cpumask_var(online_mask);
+free_fallback_mask:
+	bitmap_free(active_hctx);
+	free_cpumask_var(mask);
+
 fallback:
+	blk_mq_map_fallback(qmap);
+	return;
+
+free_map_software:
+	free_cpumask_var(online_mask);
+	free_cpumask_var(mask);
+	bitmap_free(active_hctx);
+map_software:
 	blk_mq_map_queues(qmap);
 }
 EXPORT_SYMBOL_GPL(blk_mq_map_hw_queues);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v15 6/8] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
  2026-05-21 23:29 [PATCH v15 0/8] blk: honor isolcpus configuration Aaron Tomlin
                   ` (4 preceding siblings ...)
  2026-05-21 23:29 ` [PATCH v15 5/8] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Aaron Tomlin
@ 2026-05-21 23:29 ` Aaron Tomlin
  2026-05-21 23:29 ` [PATCH v15 7/8] genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs Aaron Tomlin
  2026-05-21 23:29 ` [PATCH v15 8/8] docs: add io_queue flag to isolcpus Aaron Tomlin
  7 siblings, 0 replies; 9+ messages in thread
From: Aaron Tomlin @ 2026-05-21 23:29 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, tom.leiming, steve, sean, chjohnst, neelx, mproche,
	nick.lange, marco.crivellari, rishil1999, linux-block,
	linux-kernel

From: Daniel Wagner <wagi@kernel.org>

When isolcpus=io_queue is enabled and the last housekeeping CPU
for a given hctx goes offline, no CPU would be left to handle I/O.
To prevent I/O stalls, disallow offlining housekeeping CPUs that are
still serving isolated CPUs.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
[atomlin:
    - Removed duplicate paragraph from commit message
    - Allow offlining of non-housekeeping CPUs
    - Fix logic flaw that prematurely rejected valid offline requests
    - Iterated over cpu_online_mask and manually reverse-mapped CPUs to
      correctly detect isolated CPUs, as blk_mq_map_swqueue()
      intentionally prunes them from hctx->cpumask
    - Drop hctx->queue->disk->disk_name from warning to avoid UAF bug
    - Ensure isolation constraints are only enforced for CPUs actively
      mapped to the evaluated hardware queue
    - Correct pr_warn format specifier]
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 block/blk-mq.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d0c37daf568f..7c3f4d6546f0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3739,6 +3739,56 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
 	return data.has_rq;
 }
 
+static bool blk_mq_hctx_can_offline_hk_cpu(struct blk_mq_hw_ctx *hctx,
+					   unsigned int this_cpu)
+{
+	const struct cpumask *hk_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+	int cpu, fallback_isolated_cpu = -1;
+
+	/*
+	 * If the CPU being offlined is not a housekeeping CPU,
+	 * offlining it will not strand isolated CPUs. Allow it.
+	 */
+	if (!cpumask_test_cpu(this_cpu, hk_mask))
+		return true;
+	/*
+	 * If this CPU is not mapped to this specific hardware context,
+	 * offlining it will not affect the context's I/O routing. Allow it.
+	 */
+	if (blk_mq_map_queue_type(hctx->queue, hctx->type, this_cpu) != hctx)
+		return true;
+	/*
+	 * Iterate over all online CPUs and manually check their mapping.
+	 * We cannot use hctx->cpumask here because blk_mq_map_swqueue()
+	 * intentionally strips isolated CPUs from it to prevent kworker
+	 * routing.
+	 */
+	for_each_online_cpu(cpu) {
+		struct blk_mq_hw_ctx *h;
+
+		if (cpu == this_cpu)
+			continue;
+
+		h = blk_mq_map_queue_type(hctx->queue, hctx->type, cpu);
+		if (h != hctx)
+			continue;
+
+		if (cpumask_test_cpu(cpu, hk_mask))
+			return true;
+
+		if (fallback_isolated_cpu == -1)
+			fallback_isolated_cpu = cpu;
+	}
+
+	if (fallback_isolated_cpu != -1) {
+		pr_warn("blk-mq: trying to offline hctx%u but online isolated CPU %d is still mapped to it\n",
+			hctx->queue_num, fallback_isolated_cpu);
+		return false;
+	}
+
+	return true;
+}
+
 static bool blk_mq_hctx_has_online_cpu(struct blk_mq_hw_ctx *hctx,
 		unsigned int this_cpu)
 {
@@ -3771,6 +3821,11 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 			struct blk_mq_hw_ctx, cpuhp_online);
 	int ret = 0;
 
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
+		if (!blk_mq_hctx_can_offline_hk_cpu(hctx, cpu))
+			return -EINVAL;
+	}
+
 	if (!hctx->nr_ctx || blk_mq_hctx_has_online_cpu(hctx, cpu))
 		return 0;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v15 7/8] genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs
  2026-05-21 23:29 [PATCH v15 0/8] blk: honor isolcpus configuration Aaron Tomlin
                   ` (5 preceding siblings ...)
  2026-05-21 23:29 ` [PATCH v15 6/8] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Aaron Tomlin
@ 2026-05-21 23:29 ` Aaron Tomlin
  2026-05-21 23:29 ` [PATCH v15 8/8] docs: add io_queue flag to isolcpus Aaron Tomlin
  7 siblings, 0 replies; 9+ messages in thread
From: Aaron Tomlin @ 2026-05-21 23:29 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, tom.leiming, steve, sean, chjohnst, neelx, mproche,
	nick.lange, marco.crivellari, rishil1999, linux-block,
	linux-kernel

At present, the managed interrupt spreading algorithm distributes vectors
across all available CPUs within a given node or system. On systems
employing CPU isolation (e.g., "isolcpus=io_queue"), this behaviour
defeats the primary purpose of isolation by routing hardware interrupts
(such as NVMe completion queues) directly to isolated cores.

Update irq_create_affinity_masks() to respect the housekeeping CPU mask.
By passing the HK_TYPE_IO_QUEUE mask directly to the topological
distribution function (group_mask_cpus_evenly()), we ensure that managed
interrupts are kept strictly off isolated CPUs.

This patch additionally addresses the architectural constraints of
restricted vector distribution:

    1.  Vector Limits and Overrides: Updated irq_calc_affinity_vectors()
        to strictly bound the maximum number of allocated vectors to the
        weight of the housekeeping mask. This correctly overrides
        drivers providing a calc_sets() callback, preventing them from
        wasting memory on dead hardware queues that cannot be routed to
        isolated CPUs.

    2.  Multi-set Alignment and Leak Prevention: When isolation
        constraints result in fewer available masks than requested
        vectors for a given set, the remaining vector slots are padded
        with the housekeeping mask. This replaces the historical
        irq_default_affinity padding, ensuring excess managed queues do
        not leak interrupts onto isolated CPUs.

    3.  Minimum Vector Safety Net: To prevent fatal -ENOSPC device probe
        aborts on heavily isolated systems (where the housekeeping CPU
        count might be lower than a device's structural minimum), the
        final vector calculation is safeguarded to never drop below
        minvec. Queues will safely share the available housekeeping CPUs
        instead of failing the probe.

    4.  Zero Overhead: The housekeeping mask is conditionally assigned
        via a direct pointer, completely avoiding temporary mask
        allocations (e.g., alloc_cpumask_var) and bitwise operations
        when CPU isolation is disabled. This guarantees zero performance
        or memory overhead for standard configurations.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 kernel/irq/affinity.c | 31 +++++++++++++++++++++++--------
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 78f2418a8925..dade92f8b4b3 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -8,6 +8,7 @@
 #include <linux/slab.h>
 #include <linux/cpu.h>
 #include <linux/group_cpus.h>
+#include <linux/sched/isolation.h>
 
 static void default_calc_sets(struct irq_affinity *affd, unsigned int affvecs)
 {
@@ -25,8 +26,10 @@ static void default_calc_sets(struct irq_affinity *affd, unsigned int affvecs)
 struct irq_affinity_desc *
 irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 {
-	unsigned int affvecs, curvec, usedvecs, i;
+	unsigned int affvecs, curvec, usedvecs, i, j;
 	struct irq_affinity_desc *masks = NULL;
+	const struct cpumask *hk_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+	bool hk_enabled = housekeeping_enabled(HK_TYPE_IO_QUEUE);
 
 	/*
 	 * Determine the number of vectors which need interrupt affinities
@@ -70,19 +73,29 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	 */
 	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
 		unsigned int nr_masks, this_vecs = affd->set_size[i];
-		struct cpumask *result = group_cpus_evenly(this_vecs, &nr_masks);
+		struct cpumask *result;
+		const struct cpumask *mask;
 
+		if (hk_enabled)
+			mask = hk_mask;
+		else
+			mask = cpu_possible_mask;
+
+		result = group_mask_cpus_evenly(this_vecs, mask,
+						&nr_masks);
 		if (!result) {
 			kfree(masks);
 			return NULL;
 		}
-
-		for (int j = 0; j < nr_masks; j++)
+		for (j = 0; j < nr_masks; j++)
 			cpumask_copy(&masks[curvec + j].mask, &result[j]);
+		for (j = nr_masks; j < this_vecs; j++)
+			cpumask_copy(&masks[curvec + j].mask, mask);
+
 		kfree(result);
 
-		curvec += nr_masks;
-		usedvecs += nr_masks;
+		curvec += this_vecs;
+		usedvecs += this_vecs;
 	}
 
 	/* Fill out vectors at the end that don't need affinity */
@@ -115,10 +128,12 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec,
 	if (resv > minvec)
 		return 0;
 
-	if (affd->calc_sets)
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
+		set_vecs = cpumask_weight(housekeeping_cpumask(HK_TYPE_IO_QUEUE));
+	else if (affd->calc_sets)
 		set_vecs = maxvec - resv;
 	else
 		set_vecs = cpumask_weight(cpu_possible_mask);
 
-	return resv + min(set_vecs, maxvec - resv);
+	return max(minvec, resv + min(set_vecs, maxvec - resv));
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v15 8/8] docs: add io_queue flag to isolcpus
  2026-05-21 23:29 [PATCH v15 0/8] blk: honor isolcpus configuration Aaron Tomlin
                   ` (6 preceding siblings ...)
  2026-05-21 23:29 ` [PATCH v15 7/8] genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs Aaron Tomlin
@ 2026-05-21 23:29 ` Aaron Tomlin
  7 siblings, 0 replies; 9+ messages in thread
From: Aaron Tomlin @ 2026-05-21 23:29 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, mst
  Cc: atomlin, aacraid, James.Bottomley, martin.petersen, liyihang9,
	kashyap.desai, sumit.saxena, shivasharan.srikanteshwara,
	chandrakanth.patil, sathya.prakash, sreekanth.reddy,
	suganath-prabu.subramani, ranjan.kumar, jinpu.wang, tglx, mingo,
	peterz, juri.lelli, vincent.guittot, akpm, maz, ruanjinjie,
	bigeasy, yphbchou0911, wagi, frederic, longman, chenridong, hare,
	kch, ming.lei, tom.leiming, steve, sean, chjohnst, neelx, mproche,
	nick.lange, marco.crivellari, rishil1999, linux-block,
	linux-kernel

From: Daniel Wagner <wagi@kernel.org>

The io_queue flag informs multiqueue device drivers where to place
hardware queues. Document this new flag in the isolcpus
command-line argument description.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
[atomlin:
    - Refined io_queue kernel parameter documentation
    - Removed an inaccurate claim in the documentation stating
      that io_queue takes precedence over managed_irq]
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 .../admin-guide/kernel-parameters.txt         | 26 ++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 4d0f545fb3ec..fb828bb60b9e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2792,7 +2792,6 @@ Kernel parameters
 			  "number of CPUs in system - 1".
 
 			managed_irq
-
 			  Isolate from being targeted by managed interrupts
 			  which have an interrupt mask containing isolated
 			  CPUs. The affinity of managed interrupts is
@@ -2815,6 +2814,31 @@ Kernel parameters
 			  housekeeping CPUs has no influence on those
 			  queues.
 
+			io_queue
+			  Applicable to managed IRQs only. Restrict
+			  multiqueue hardware queue allocation to online
+			  housekeeping CPUs. This guarantees that all
+			  managed hardware completion interrupts are routed
+			  exclusively to housekeeping cores, shielding
+			  isolated CPUs from I/O interruptions even if they
+			  initiated the request.
+
+			  Note: Using io_queue restricts the number of
+			  allocated hardware queues to match the number of
+			  housekeeping CPUs. This prevents MSI-X vector
+			  exhaustion and forces isolated CPUs to share
+			  submission queues.
+
+			  Note: Offlining housekeeping CPUs which serve
+			  isolated CPUs will fail. The isolated CPUs must
+			  be offlined before offlining the housekeeping
+			  CPUs.
+
+			  Note: When I/O is submitted by an application on
+			  an isolated CPU, the hardware completion
+			  interrupt is handled entirely by a housekeeping
+			  CPU.
+
 			The format of <cpu-list> is described above.
 
 	iucv=		[HW,NET]
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-05-21 23:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-21 23:29 [PATCH v15 0/8] blk: honor isolcpus configuration Aaron Tomlin
2026-05-21 23:29 ` [PATCH v15 1/8] scsi: aacraid: use block layer helpers to calculate num of queues Aaron Tomlin
2026-05-21 23:29 ` [PATCH v15 2/8] lib/group_cpus: remove dead !SMP code Aaron Tomlin
2026-05-21 23:29 ` [PATCH v15 3/8] lib/group_cpus: Add group_mask_cpus_evenly() Aaron Tomlin
2026-05-21 23:29 ` [PATCH v15 4/8] isolation: Introduce io_queue isolcpus type Aaron Tomlin
2026-05-21 23:29 ` [PATCH v15 5/8] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Aaron Tomlin
2026-05-21 23:29 ` [PATCH v15 6/8] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Aaron Tomlin
2026-05-21 23:29 ` [PATCH v15 7/8] genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs Aaron Tomlin
2026-05-21 23:29 ` [PATCH v15 8/8] docs: add io_queue flag to isolcpus Aaron Tomlin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox