From: Aaron Tomlin <atomlin@atomlin.com>
To: axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me,
mst@redhat.com
Cc: atomlin@atomlin.com, aacraid@microsemi.com,
James.Bottomley@HansenPartnership.com,
martin.petersen@oracle.com, liyihang9@h-partners.com,
kashyap.desai@broadcom.com, sumit.saxena@broadcom.com,
shivasharan.srikanteshwara@broadcom.com,
chandrakanth.patil@broadcom.com, sathya.prakash@broadcom.com,
sreekanth.reddy@broadcom.com,
suganath-prabu.subramani@broadcom.com, ranjan.kumar@broadcom.com,
jinpu.wang@cloud.ionos.com, tglx@kernel.org, mingo@redhat.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, akpm@linux-foundation.org,
maz@kernel.org, ruanjinjie@huawei.com, bigeasy@linutronix.de,
yphbchou0911@gmail.com, wagi@kernel.org, frederic@kernel.org,
longman@redhat.com, chenridong@huawei.com, hare@suse.de,
kch@nvidia.com, ming.lei@redhat.com, tom.leiming@gmail.com,
steve@abita.co, sean@ashe.io, chjohnst@gmail.com, neelx@suse.com,
mproche@gmail.com, nick.lange@gmail.com,
marco.crivellari@suse.com, rishil1999@outlook.com,
linux-block@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH v13 0/8] blk: honor isolcpus configuration
Date: Tue, 12 May 2026 20:55:01 -0400 [thread overview]
Message-ID: <20260513005509.135966-1-atomlin@atomlin.com> (raw)
Hi,
I have decided to drive this series forward on behalf of Daniel Wagner, the
original author. The series has been rebased on v7.1-rc2-593-g1d5dcaa3bd65.
This series introduces a new CPU isolation feature, "isolcpus=io_queue",
designed to protect isolated cores from the disruptive hardware interrupts
generated by high-performance multi-queue devices.
When enabled, it fundamentally alters how the generic IRQ subsystem and the
block layer (blk-mq) map hardware queues:
1. Restricted IRQ Affinity: Managed hardware interrupts are strictly
confined to online housekeeping CPUs.
2. Transparent I/O Submission: Applications running on isolated CPUs
can still seamlessly submit I/O requests; however, the resulting
hardware completion interrupts are safely routed to a designated
housekeeping CPU.
3. Topology-Aware Queue Allocation: The generic CPU-to-hardware-queue
mapping logic is extended to distribute hardware contexts evenly
among the available housekeeping CPUs, preventing MSI-X vector
exhaustion while maintaining optimal cache locality where possible.
To prevent I/O stalls, the block layer is additionally hardened to reject
hot-plug requests that attempt to offline a housekeeping CPU if it is the
last remaining CPU actively serving an online isolated core.
This iteration abandons the complex "top-down" mask plumbing introduced in
v12, which modified struct irq_affinity and expanded block layer APIs, in
favour of centralised, direct isolation querying via
housekeeping_cpumask(HK_TYPE_IO_QUEUE) within the genirq/affinity
subsystem. This architectural simplification successfully decouples core
changes from driver-specific implementations, allowing us to drop the
virtio enablement and API modification patches (v12 patches 4, 5, 7, 8, and
9).
Please let me know your thoughts.
Changes since v12:
- Resolved TOCTOU race conditions against CPU hotplug events in
blk_mq_map_queues() and group_mask_cpus_evenly() by taking lockless
snapshots of the online CPU mask prior to algorithmic evaluation.
- Migrated the active_hctx tracking to a dynamically sized bitmap
(bitmap_zalloc), resolving a critical out-of-bounds memory write that
occurred when hardware queues exceeded the system CPU count.
- Wrapped the disk pointer fetch in blk_mq_hctx_can_offline_hk_cpu() with
READ_ONCE() to prevent a TOCTOU NULL pointer dereference against
concurrent device teardowns.
- Introduced bitmap_empty() checks to prevent the mapping logic from
routing unassigned CPUs into unallocated memory when all mapped CPUs are
offline, safely forcing a fallback mapping instead.
- Implemented a native two-stage distribution logic in
group_mask_cpus_evenly() that first prioritises physically present CPUs
to prevent I/O starvation before distributing remaining vectors to
non-present CPUs for hotplug safety.
- Restricted the maximum number of allocated vectors in
irq_calc_affinity_vectors() to the weight of the housekeeping mask,
preventing drivers from wasting memory on dead hardware queues that
physically cannot be routed.
- Added padding logic using irq_default_affinity for sets where isolation
constraints yield fewer masks than requested vectors, preserving the 1:1
hardware queue mapping sequence for subsequent sets.
- Fixed a logic flaw that prematurely rejected valid offline requests by
manually iterating over cpu_online_mask and reverse-mapping to
accurately detect isolated CPUs, properly permitting the offlining of
non-housekeeping CPUs.
- Corrected an absolute versus relative queue index calculation bug in
blk_mq_map_queues() that was overwriting loop iterations, by iterating
directly over the generated masks.
- Replaced scoped __free cleanups with traditional goto unwinding in the
block layer to align with subsystem styling guidelines.
- Refined the io_queue kernel command-line parameter documentation for
better clarity and precision.
Changes since v11:
- Removed duplicate paragraph from the commit message in patch 11
(Marco Crivellari)
- Ensure ZERO_SIZE_PTR is not returned by group_mask_cpus_evenly()
(Marco Crivellari)
- Linked to v11: https://lore.kernel.org/lkml/20260416192942.1243421-1-atomlin@atomlin.com/
Changes since v10:
- Completely rewrote the isolcpus=io_queue documentation in
Documentation/admin-guide/kernel-parameters.txt to clarify its exclusive
application to managed IRQs, queue allocation limits, vector exhaustion
prevention, and hardware interrupt routing (Ming Lei)
- Fixed a stack frame bloat issue by avoiding the on-stack declaration of
struct cpumask (Waiman Long)
- Linked to v10: https://lore.kernel.org/linux-nvme/20260401222312.772334-1-atomlin@atomlin.com/
Changes since v9:
- Fixed a page fault regression encountered when initialising secondary
queue maps (e.g., NVMe poll queues). Restored the qmap->queue_offset to
the mq_map assignment to ensure CPUs are strictly mapped to absolute
hardware indices (Keith Busch)
- Corrected the active_hctx tracker to utilise relative queue indices,
preventing out-of-bounds mask assignments
- Fixed the blk_mq_validate() sanity check to properly evaluate absolute
queue indices against the offset-adjusted loop index
- Corrected typographical errors within block/blk-mq-cpumap.c
(Keith Busch)
- Clarified the commit message regarding the removal of the !SMP fallback
code, explicitly noting that the core scheduler now mandates SMP
unconditionally (Sebastian Andrzej Siewior)
- Added missing "Signed-off-by:" tags to properly record the patch series
chain of custody
- Linked to v9: https://lore.kernel.org/lkml/20260330221047.630206-1-atomlin@atomlin.com/
Changes since v8:
- Added "Reviewed-by:" tags
- Introduced irq_spread_hk_filter() to safely restrict managed IRQ
affinity to housekeeping CPUs (Thomas Gleixner)
- Removed the unsafe global static variable blk_hk_online_mask from
blk-mq-cpumap.c and blk-mq.c. blk_mq_online_queue_affinity() now returns
a stable pointer, delegating safe intersection to the callers to prevent
concurrent modification races (Thomas Gleixner, Hannes Reinecke)
- Resolved BUG: kernel NULL pointer dereference in __blk_mq_all_tag_iter
reported by the kernel test robot during cpuhotplug rcutorture stress
testing
- Linked to v8: https://lore.kernel.org/lkml/20250905-isolcpus-io-queues-v8-0-885984c5daca@kernel.org/
Changes since v7:
- Added commit 524f5eea4bbe ("lib/group_cpus: remove !SMP code")
- Merged the new mapping logic directly into the existing function to
avoid special casing
- Refined the group_mask_cpus_evenly() implementation with the following
updates:
- Corrected the function name typo (changed group_masks_cpus_evenly to
group_mask_cpus_evenly)
- Updated the documentation comment to accurately reflect the function's
behavior
- Renamed the cpu_mask argument to mask for consistency
- Added a new patch for aacraid to include the missing number of queues
calculation
- Restricted updates to only affect SCSI drivers that support
PCI_IRQ_AFFINITY and do not utilise nvme-fabrics
- Removed the __free cleanup attribute usage for cpumask_var_t allocations
due to compatibility issues
- Updated the documentation to explicitly highlight the limitations
surrounding CPU offlining
- Collected accumulated Reviewed-by and Acked-by tags
- Linked to v7: https://patch.msgid.link/20250702-isolcpus-io-queues-v7-0-557aa7eacce4@kernel.org
Changes since v6:
- Sent out the first part of the series independently:
https://lore.kernel.org/all/20250617-isolcpus-queue-counters-v1-0-13923686b54b@kernel.org/
- Added comprehensive kernel command-line documentation
- Added validation logic to ensure the resulting CPU-to-queue mapping is
fully operational
- Rewrote the isolcpus mapping code to properly account for active
hardware contexts (hctx)
- Introduced blk_mq_map_hk_irq_queues, which utilizes the mask retrieved
from irq_get_affinity()
- Refactored blk_mq_map_hk_queues to require the caller to explicitly test
for HK_TYPE_MANAGED_IRQ
- Linked to v6: https://patch.msgid.link/20250424-isolcpus-io-queues-v6-0-9a53a870ca1f@kernel.org
Changes since v5:
- Reintroduced the io_queue type for the isolcpus kernel parameter
- Prevented the offlining of a housekeeping CPU if an isolated CPU is
still present, upgrading this behavior from a simple warning to a hard
restriction
- Linked to v5: https://lore.kernel.org/r/20250110-isolcpus-io-queues-v5-0-0e4f118680b0@kernel.org
Changes since v4:
- Rebased the series onto the latest for-6.14/block branch.
- Updated the documentation regarding the managed_irq parameters
- Reworded the commit message for "blk-mq: issue warning when offlining
hctx with online isolcpus" for better clarity
- Split the input and output parameters in the patch "lib/group_cpus: let
group_cpu_evenly return number of groups"
- Dropped the patch "sched/isolation: document HK_TYPE housekeeping
option"
- Linked to v4: https://lore.kernel.org/r/20241217-isolcpus-io-queues-v4-0-5d355fbb1e14@kernel.org
Changes since v3:
- Added the patch "blk-mq: issue warning when offlining hctx with online
isolcpus"
- Fixed the check in group_cpus_evenly(); the condition now properly uses
housekeeping_enabled() instead of cpumask_weight(), as the latter always
returns a valid mask
- Dropped the Fixes: tag from "lib/group_cpus.c: honor housekeeping config
when grouping CPUs"
- Fixed an overlong line warning in the patch "scsi: use block layer
helpers to calculate num of queues"
- Dropped the patch "sched/isolation: Add io_queue housekeeping option" in
favor of simply documenting the housekeeping hk_type enum
- Added the patch "lib/group_cpus: let group_cpu_evenly return number of
groups"
- Collected accumulated Reviewed-by and Acked-by tags
- Split the patchset by moving foundational changes into a separate
preparation series:
https://lore.kernel.org/linux-nvme/20241202-refactor-blk-affinity-helpers-v6-0-27211e9c2cd5@kernel.org/
- Linked to v3: https://lore.kernel.org/r/20240806-isolcpus-io-queues-v3-0-da0eecfeaf8b@suse.de
Changes since v2:
- Integrated patches from Ming Lei
(https://lore.kernel.org/all/20210709081005.421340-1-ming.lei@redhat.com/):
"virtio: add APIs for retrieving vq affinity" and "blk-mq: introduce
blk_mq_dev_map_queues"
- Replaced all instances of blk_mq_pci_map_queues and
blk_mq_virtio_map_queues with the new unified blk_mq_dev_map_queues
- Updated and expanded the helper functions used for calculating the
number of queues
- Added the CPU-to-hctx mapping function specifically to support the
isolcpus=io_queue parameter
- Documented the hk_type enum and the newly introduced isolcpus=io_queue
parameter
- Added the patch "scsi: pm8001: do not overwrite PCI queue mapping"
- Linked to v2: https://lore.kernel.org/r/20240627-isolcpus-io-queues-v2-0-26a32e3c4f75@suse.de
Changes since v1:
- Updated the feature documentation for clarity and completeness
- Split the blk/nvme-pci patch into smaller, logical commits
- Dropped the HK_TYPE_IO_QUEUE macro in favor of reusing
HK_TYPE_MANAGED_IRQ
- Linked to v1: https://lore.kernel.org/r/20240621-isolcpus-io-queues-v1-0-8b169bf41083@suse.de
Aaron Tomlin (1):
genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs
Daniel Wagner (7):
scsi: aacraid: use block layer helpers to calculate num of queues
lib/group_cpus: remove dead !SMP code
lib/group_cpus: Add group_mask_cpus_evenly()
isolation: Introduce io_queue isolcpus type
blk-mq: use hk cpus only when isolcpus=io_queue is enabled
blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
docs: add io_queue flag to isolcpus
.../admin-guide/kernel-parameters.txt | 30 ++-
block/blk-mq-cpumap.c | 224 ++++++++++++++++--
block/blk-mq.c | 56 +++++
drivers/scsi/aacraid/comminit.c | 3 +-
include/linux/group_cpus.h | 3 +
include/linux/sched/isolation.h | 1 +
kernel/irq/affinity.c | 35 ++-
kernel/sched/isolation.c | 7 +
lib/group_cpus.c | 108 ++++++++-
9 files changed, 427 insertions(+), 40 deletions(-)
base-commit: 1d5dcaa3bd65f2e8c9baa14a393d3a2dc5db7524
--
2.51.0
next reply other threads:[~2026-05-13 0:55 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-13 0:55 Aaron Tomlin [this message]
2026-05-13 0:55 ` [PATCH v13 1/8] scsi: aacraid: use block layer helpers to calculate num of queues Aaron Tomlin
2026-05-13 0:55 ` [PATCH v13 2/8] lib/group_cpus: remove dead !SMP code Aaron Tomlin
2026-05-13 0:55 ` [PATCH v13 3/8] lib/group_cpus: Add group_mask_cpus_evenly() Aaron Tomlin
2026-05-13 0:55 ` [PATCH v13 4/8] isolation: Introduce io_queue isolcpus type Aaron Tomlin
2026-05-13 0:55 ` [PATCH v13 5/8] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Aaron Tomlin
[not found] ` <3af2cd18-1221-4ff6-aa7f-6dab74460eab@nitrogen.local>
2026-05-13 23:30 ` Aaron Tomlin
2026-05-14 10:42 ` Daniel Wagner
2026-05-13 0:55 ` [PATCH v13 6/8] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Aaron Tomlin
2026-05-13 0:55 ` [PATCH v13 7/8] genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs Aaron Tomlin
2026-05-13 0:55 ` [PATCH v13 8/8] docs: add io_queue flag to isolcpus Aaron Tomlin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260513005509.135966-1-atomlin@atomlin.com \
--to=atomlin@atomlin.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=aacraid@microsemi.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=bigeasy@linutronix.de \
--cc=chandrakanth.patil@broadcom.com \
--cc=chenridong@huawei.com \
--cc=chjohnst@gmail.com \
--cc=frederic@kernel.org \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=jinpu.wang@cloud.ionos.com \
--cc=juri.lelli@redhat.com \
--cc=kashyap.desai@broadcom.com \
--cc=kbusch@kernel.org \
--cc=kch@nvidia.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=liyihang9@h-partners.com \
--cc=longman@redhat.com \
--cc=marco.crivellari@suse.com \
--cc=martin.petersen@oracle.com \
--cc=maz@kernel.org \
--cc=ming.lei@redhat.com \
--cc=mingo@redhat.com \
--cc=mproche@gmail.com \
--cc=mst@redhat.com \
--cc=neelx@suse.com \
--cc=nick.lange@gmail.com \
--cc=peterz@infradead.org \
--cc=ranjan.kumar@broadcom.com \
--cc=rishil1999@outlook.com \
--cc=ruanjinjie@huawei.com \
--cc=sagi@grimberg.me \
--cc=sathya.prakash@broadcom.com \
--cc=sean@ashe.io \
--cc=shivasharan.srikanteshwara@broadcom.com \
--cc=sreekanth.reddy@broadcom.com \
--cc=steve@abita.co \
--cc=suganath-prabu.subramani@broadcom.com \
--cc=sumit.saxena@broadcom.com \
--cc=tglx@kernel.org \
--cc=tom.leiming@gmail.com \
--cc=vincent.guittot@linaro.org \
--cc=wagi@kernel.org \
--cc=yphbchou0911@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.