From: Aaron Tomlin <atomlin@atomlin.com>
To: axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me,
mst@redhat.com
Cc: atomlin@atomlin.com, aacraid@microsemi.com,
James.Bottomley@HansenPartnership.com,
martin.petersen@oracle.com, liyihang9@h-partners.com,
kashyap.desai@broadcom.com, sumit.saxena@broadcom.com,
shivasharan.srikanteshwara@broadcom.com,
chandrakanth.patil@broadcom.com, sathya.prakash@broadcom.com,
sreekanth.reddy@broadcom.com,
suganath-prabu.subramani@broadcom.com, ranjan.kumar@broadcom.com,
jinpu.wang@cloud.ionos.com, tglx@kernel.org, mingo@redhat.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, akpm@linux-foundation.org,
maz@kernel.org, ruanjinjie@huawei.com, bigeasy@linutronix.de,
yphbchou0911@gmail.com, wagi@kernel.org, frederic@kernel.org,
longman@redhat.com, chenridong@huawei.com, hare@suse.de,
kch@nvidia.com, ming.lei@redhat.com, tom.leiming@gmail.com,
steve@abita.co, sean@ashe.io, chjohnst@gmail.com, neelx@suse.com,
mproche@gmail.com, nick.lange@gmail.com,
marco.crivellari@suse.com, rishil1999@outlook.com,
linux-block@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH v13 0/8] blk: honor isolcpus configuration
Date: Tue, 12 May 2026 20:55:01 -0400 [thread overview]
Message-ID: <20260513005509.135966-1-atomlin@atomlin.com> (raw)
Hi,
I have decided to drive this series forward on behalf of Daniel Wagner, the
original author. The series has been rebased on v7.1-rc2-593-g1d5dcaa3bd65.
This series introduces a new CPU isolation feature, "isolcpus=io_queue",
designed to protect isolated cores from the disruptive hardware interrupts
generated by high-performance multi-queue devices.
When enabled, it fundamentally alters how the generic IRQ subsystem and the
block layer (blk-mq) map hardware queues:
1. Restricted IRQ Affinity: Managed hardware interrupts are strictly
confined to online housekeeping CPUs.
2. Transparent I/O Submission: Applications running on isolated CPUs
can still seamlessly submit I/O requests; however, the resulting
hardware completion interrupts are safely routed to a designated
housekeeping CPU.
3. Topology-Aware Queue Allocation: The generic CPU-to-hardware-queue
mapping logic is extended to distribute hardware contexts evenly
among the available housekeeping CPUs, preventing MSI-X vector
exhaustion while maintaining optimal cache locality where possible.
To prevent I/O stalls, the block layer is additionally hardened to reject
hot-plug requests that attempt to offline a housekeeping CPU if it is the
last remaining CPU actively serving an online isolated core.
This iteration abandons the complex "top-down" mask plumbing introduced in
v12, which modified struct irq_affinity and expanded block layer APIs, in
favour of centralised, direct isolation querying via
housekeeping_cpumask(HK_TYPE_IO_QUEUE) within the genirq/affinity
subsystem. This architectural simplification successfully decouples core
changes from driver-specific implementations, allowing us to drop the
virtio enablement and API modification patches (v12 patches 4, 5, 7, 8, and
9).
Please let me know your thoughts.
Changes since v12:
- Resolved TOCTOU race conditions against CPU hotplug events in
blk_mq_map_queues() and group_mask_cpus_evenly() by taking lockless
snapshots of the online CPU mask prior to algorithmic evaluation.
- Migrated the active_hctx tracking to a dynamically sized bitmap
(bitmap_zalloc), resolving a critical out-of-bounds memory write that
occurred when hardware queues exceeded the system CPU count.
- Wrapped the disk pointer fetch in blk_mq_hctx_can_offline_hk_cpu() with
READ_ONCE() to prevent a TOCTOU NULL pointer dereference against
concurrent device teardowns.
- Introduced bitmap_empty() checks to prevent the mapping logic from
routing unassigned CPUs into unallocated memory when all mapped CPUs are
offline, safely forcing a fallback mapping instead.
- Implemented a native two-stage distribution logic in
group_mask_cpus_evenly() that first prioritises physically present CPUs
to prevent I/O starvation before distributing remaining vectors to
non-present CPUs for hotplug safety.
- Restricted the maximum number of allocated vectors in
irq_calc_affinity_vectors() to the weight of the housekeeping mask,
preventing drivers from wasting memory on dead hardware queues that
physically cannot be routed.
- Added padding logic using irq_default_affinity for sets where isolation
constraints yield fewer masks than requested vectors, preserving the 1:1
hardware queue mapping sequence for subsequent sets.
- Fixed a logic flaw that prematurely rejected valid offline requests by
manually iterating over cpu_online_mask and reverse-mapping to
accurately detect isolated CPUs, properly permitting the offlining of
non-housekeeping CPUs.
- Corrected an absolute versus relative queue index calculation bug in
blk_mq_map_queues() that was overwriting loop iterations, by iterating
directly over the generated masks.
- Replaced scoped __free cleanups with traditional goto unwinding in the
block layer to align with subsystem styling guidelines.
- Refined the io_queue kernel command-line parameter documentation for
better clarity and precision.
Changes since v11:
- Removed duplicate paragraph from the commit message in patch 11
(Marco Crivellari)
- Ensure ZERO_SIZE_PTR is not returned by group_mask_cpus_evenly()
(Marco Crivellari)
- Linked to v11: https://lore.kernel.org/lkml/20260416192942.1243421-1-atomlin@atomlin.com/
Changes since v10:
- Completely rewrote the isolcpus=io_queue documentation in
Documentation/admin-guide/kernel-parameters.txt to clarify its exclusive
application to managed IRQs, queue allocation limits, vector exhaustion
prevention, and hardware interrupt routing (Ming Lei)
- Fixed a stack frame bloat issue by avoiding the on-stack declaration of
struct cpumask (Waiman Long)
- Linked to v10: https://lore.kernel.org/linux-nvme/20260401222312.772334-1-atomlin@atomlin.com/
Changes since v9:
- Fixed a page fault regression encountered when initialising secondary
queue maps (e.g., NVMe poll queues). Restored the qmap->queue_offset to
the mq_map assignment to ensure CPUs are strictly mapped to absolute
hardware indices (Keith Busch)
- Corrected the active_hctx tracker to utilise relative queue indices,
preventing out-of-bounds mask assignments
- Fixed the blk_mq_validate() sanity check to properly evaluate absolute
queue indices against the offset-adjusted loop index
- Corrected typographical errors within block/blk-mq-cpumap.c
(Keith Busch)
- Clarified the commit message regarding the removal of the !SMP fallback
code, explicitly noting that the core scheduler now mandates SMP
unconditionally (Sebastian Andrzej Siewior)
- Added missing "Signed-off-by:" tags to properly record the patch series
chain of custody
- Linked to v9: https://lore.kernel.org/lkml/20260330221047.630206-1-atomlin@atomlin.com/
Changes since v8:
- Added "Reviewed-by:" tags
- Introduced irq_spread_hk_filter() to safely restrict managed IRQ
affinity to housekeeping CPUs (Thomas Gleixner)
- Removed the unsafe global static variable blk_hk_online_mask from
blk-mq-cpumap.c and blk-mq.c. blk_mq_online_queue_affinity() now returns
a stable pointer, delegating safe intersection to the callers to prevent
concurrent modification races (Thomas Gleixner, Hannes Reinecke)
- Resolved BUG: kernel NULL pointer dereference in __blk_mq_all_tag_iter
reported by the kernel test robot during cpuhotplug rcutorture stress
testing
- Linked to v8: https://lore.kernel.org/lkml/20250905-isolcpus-io-queues-v8-0-885984c5daca@kernel.org/
Changes since v7:
- Added commit 524f5eea4bbe ("lib/group_cpus: remove !SMP code")
- Merged the new mapping logic directly into the existing function to
avoid special casing
- Refined the group_mask_cpus_evenly() implementation with the following
updates:
- Corrected the function name typo (changed group_masks_cpus_evenly to
group_mask_cpus_evenly)
- Updated the documentation comment to accurately reflect the function's
behavior
- Renamed the cpu_mask argument to mask for consistency
- Added a new patch for aacraid to include the missing number of queues
calculation
- Restricted updates to only affect SCSI drivers that support
PCI_IRQ_AFFINITY and do not utilise nvme-fabrics
- Removed the __free cleanup attribute usage for cpumask_var_t allocations
due to compatibility issues
- Updated the documentation to explicitly highlight the limitations
surrounding CPU offlining
- Collected accumulated Reviewed-by and Acked-by tags
- Linked to v7: https://patch.msgid.link/20250702-isolcpus-io-queues-v7-0-557aa7eacce4@kernel.org
Changes since v6:
- Sent out the first part of the series independently:
https://lore.kernel.org/all/20250617-isolcpus-queue-counters-v1-0-13923686b54b@kernel.org/
- Added comprehensive kernel command-line documentation
- Added validation logic to ensure the resulting CPU-to-queue mapping is
fully operational
- Rewrote the isolcpus mapping code to properly account for active
hardware contexts (hctx)
- Introduced blk_mq_map_hk_irq_queues, which utilizes the mask retrieved
from irq_get_affinity()
- Refactored blk_mq_map_hk_queues to require the caller to explicitly test
for HK_TYPE_MANAGED_IRQ
- Linked to v6: https://patch.msgid.link/20250424-isolcpus-io-queues-v6-0-9a53a870ca1f@kernel.org
Changes since v5:
- Reintroduced the io_queue type for the isolcpus kernel parameter
- Prevented the offlining of a housekeeping CPU if an isolated CPU is
still present, upgrading this behavior from a simple warning to a hard
restriction
- Linked to v5: https://lore.kernel.org/r/20250110-isolcpus-io-queues-v5-0-0e4f118680b0@kernel.org
Changes since v4:
- Rebased the series onto the latest for-6.14/block branch.
- Updated the documentation regarding the managed_irq parameters
- Reworded the commit message for "blk-mq: issue warning when offlining
hctx with online isolcpus" for better clarity
- Split the input and output parameters in the patch "lib/group_cpus: let
group_cpu_evenly return number of groups"
- Dropped the patch "sched/isolation: document HK_TYPE housekeeping
option"
- Linked to v4: https://lore.kernel.org/r/20241217-isolcpus-io-queues-v4-0-5d355fbb1e14@kernel.org
Changes since v3:
- Added the patch "blk-mq: issue warning when offlining hctx with online
isolcpus"
- Fixed the check in group_cpus_evenly(); the condition now properly uses
housekeeping_enabled() instead of cpumask_weight(), as the latter always
returns a valid mask
- Dropped the Fixes: tag from "lib/group_cpus.c: honor housekeeping config
when grouping CPUs"
- Fixed an overlong line warning in the patch "scsi: use block layer
helpers to calculate num of queues"
- Dropped the patch "sched/isolation: Add io_queue housekeeping option" in
favor of simply documenting the housekeeping hk_type enum
- Added the patch "lib/group_cpus: let group_cpu_evenly return number of
groups"
- Collected accumulated Reviewed-by and Acked-by tags
- Split the patchset by moving foundational changes into a separate
preparation series:
https://lore.kernel.org/linux-nvme/20241202-refactor-blk-affinity-helpers-v6-0-27211e9c2cd5@kernel.org/
- Linked to v3: https://lore.kernel.org/r/20240806-isolcpus-io-queues-v3-0-da0eecfeaf8b@suse.de
Changes since v2:
- Integrated patches from Ming Lei
(https://lore.kernel.org/all/20210709081005.421340-1-ming.lei@redhat.com/):
"virtio: add APIs for retrieving vq affinity" and "blk-mq: introduce
blk_mq_dev_map_queues"
- Replaced all instances of blk_mq_pci_map_queues and
blk_mq_virtio_map_queues with the new unified blk_mq_dev_map_queues
- Updated and expanded the helper functions used for calculating the
number of queues
- Added the CPU-to-hctx mapping function specifically to support the
isolcpus=io_queue parameter
- Documented the hk_type enum and the newly introduced isolcpus=io_queue
parameter
- Added the patch "scsi: pm8001: do not overwrite PCI queue mapping"
- Linked to v2: https://lore.kernel.org/r/20240627-isolcpus-io-queues-v2-0-26a32e3c4f75@suse.de
Changes since v1:
- Updated the feature documentation for clarity and completeness
- Split the blk/nvme-pci patch into smaller, logical commits
- Dropped the HK_TYPE_IO_QUEUE macro in favor of reusing
HK_TYPE_MANAGED_IRQ
- Linked to v1: https://lore.kernel.org/r/20240621-isolcpus-io-queues-v1-0-8b169bf41083@suse.de
Aaron Tomlin (1):
genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs
Daniel Wagner (7):
scsi: aacraid: use block layer helpers to calculate num of queues
lib/group_cpus: remove dead !SMP code
lib/group_cpus: Add group_mask_cpus_evenly()
isolation: Introduce io_queue isolcpus type
blk-mq: use hk cpus only when isolcpus=io_queue is enabled
blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
docs: add io_queue flag to isolcpus
.../admin-guide/kernel-parameters.txt | 30 ++-
block/blk-mq-cpumap.c | 224 ++++++++++++++++--
block/blk-mq.c | 56 +++++
drivers/scsi/aacraid/comminit.c | 3 +-
include/linux/group_cpus.h | 3 +
include/linux/sched/isolation.h | 1 +
kernel/irq/affinity.c | 35 ++-
kernel/sched/isolation.c | 7 +
lib/group_cpus.c | 108 ++++++++-
9 files changed, 427 insertions(+), 40 deletions(-)
base-commit: 1d5dcaa3bd65f2e8c9baa14a393d3a2dc5db7524
--
2.51.0
next reply other threads:[~2026-05-13 0:55 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-13 0:55 Aaron Tomlin [this message]
2026-05-13 0:55 ` [PATCH v13 1/8] scsi: aacraid: use block layer helpers to calculate num of queues Aaron Tomlin
2026-05-13 0:55 ` [PATCH v13 2/8] lib/group_cpus: remove dead !SMP code Aaron Tomlin
2026-05-13 0:55 ` [PATCH v13 3/8] lib/group_cpus: Add group_mask_cpus_evenly() Aaron Tomlin
2026-05-13 0:55 ` [PATCH v13 4/8] isolation: Introduce io_queue isolcpus type Aaron Tomlin
2026-05-13 0:55 ` [PATCH v13 5/8] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Aaron Tomlin
2026-05-13 0:55 ` [PATCH v13 6/8] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Aaron Tomlin
2026-05-13 0:55 ` [PATCH v13 7/8] genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs Aaron Tomlin
2026-05-13 0:55 ` [PATCH v13 8/8] docs: add io_queue flag to isolcpus Aaron Tomlin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260513005509.135966-1-atomlin@atomlin.com \
--to=atomlin@atomlin.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=aacraid@microsemi.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=bigeasy@linutronix.de \
--cc=chandrakanth.patil@broadcom.com \
--cc=chenridong@huawei.com \
--cc=chjohnst@gmail.com \
--cc=frederic@kernel.org \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=jinpu.wang@cloud.ionos.com \
--cc=juri.lelli@redhat.com \
--cc=kashyap.desai@broadcom.com \
--cc=kbusch@kernel.org \
--cc=kch@nvidia.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=liyihang9@h-partners.com \
--cc=longman@redhat.com \
--cc=marco.crivellari@suse.com \
--cc=martin.petersen@oracle.com \
--cc=maz@kernel.org \
--cc=ming.lei@redhat.com \
--cc=mingo@redhat.com \
--cc=mproche@gmail.com \
--cc=mst@redhat.com \
--cc=neelx@suse.com \
--cc=nick.lange@gmail.com \
--cc=peterz@infradead.org \
--cc=ranjan.kumar@broadcom.com \
--cc=rishil1999@outlook.com \
--cc=ruanjinjie@huawei.com \
--cc=sagi@grimberg.me \
--cc=sathya.prakash@broadcom.com \
--cc=sean@ashe.io \
--cc=shivasharan.srikanteshwara@broadcom.com \
--cc=sreekanth.reddy@broadcom.com \
--cc=steve@abita.co \
--cc=suganath-prabu.subramani@broadcom.com \
--cc=sumit.saxena@broadcom.com \
--cc=tglx@kernel.org \
--cc=tom.leiming@gmail.com \
--cc=vincent.guittot@linaro.org \
--cc=wagi@kernel.org \
--cc=yphbchou0911@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox