Re: [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Aaron Tomlin <atomlin@atomlin.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me,
	mst@redhat.com, aacraid@microsemi.com,
	James.Bottomley@hansenpartnership.com,
	martin.petersen@oracle.com, liyihang9@h-partners.com,
	kashyap.desai@broadcom.com, sumit.saxena@broadcom.com,
	shivasharan.srikanteshwara@broadcom.com,
	chandrakanth.patil@broadcom.com, sathya.prakash@broadcom.com,
	sreekanth.reddy@broadcom.com,
	suganath-prabu.subramani@broadcom.com, ranjan.kumar@broadcom.com,
	jinpu.wang@cloud.ionos.com, tglx@kernel.org, mingo@redhat.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, akpm@linux-foundation.org,
	maz@kernel.org, ruanjinjie@huawei.com, yphbchou0911@gmail.com,
	wagi@kernel.org, frederic@kernel.org, longman@redhat.com,
	chenridong@huawei.com, hare@suse.de, kch@nvidia.com,
	steve@abita.co, sean@ashe.io, chjohnst@gmail.com, neelx@suse.com,
	mproche@gmail.com, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, virtualization@lists.linux.dev,
	linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org,
	megaraidlinux.pdl@broadcom.com, mpi3mr-linuxdrv.pdl@broadcom.com,
	MPT-FusionLinux.pdl@broadcom.com
Subject: Re: [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type
Date: Fri, 3 Apr 2026 09:20:17 +0800	[thread overview]
Message-ID: <ac8V0RO-yg3juOox@fedora> (raw)
In-Reply-To: <sluplntvagevh6ehfm3kqinbh23d2gnin7stkptxk6drvogh2g@4hpz74fidrq2>

On Thu, Apr 02, 2026 at 08:50:55PM -0400, Aaron Tomlin wrote:
> On Thu, Apr 02, 2026 at 11:09:40AM +0200, Sebastian Andrzej Siewior wrote:
> > On 2026-04-01 16:58:22 [-0400], Aaron Tomlin wrote:
> > > Hi Sebastian,
> > Hi,
> > 
> > > Thank you for taking the time to document the "managed_irq" behaviour; it
> > > is immensely helpful. You raise a highly pertinent point regarding the
> > > potential proliferation of "isolcpus=" flags. It is certainly a situation
> > > that must be managed carefully to prevent every subsystem from demanding
> > > its own bit.
> > > 
> > > To clarify the reasoning behind introducing "io_queue" rather than strictly
> > > relying on managed_irq:
> > > 
> > > The managed_irq flag belongs firmly to the interrupt subsystem. It dictates
> > > whether a CPU is eligible to receive hardware interrupts whose affinity is
> > > managed by the kernel. Whilst many modern block drivers use managed IRQs,
> > > the block layer multi-queue mapping encompasses far more than just
> > > interrupt routing. It maps logical queues to CPUs to handle I/O submission,
> > > software queues, and crucially, poll queues, which do not utilise
> > > interrupts at all. Furthermore, there are specific drivers that do not use
> > > the managed IRQ infrastructure but still rely on the block layer for queue
> > > distribution.
> > 
> > Could you tell block which queue maps to which CPU at /sys/block/$$/mq/
> > level? Then you have one queue going to one CPU.
> > Then the drive could request one or more interrupts managed or not. For
> > managed you could specify a CPU mask which you desire to occupy.
> > You have the case where
> > - you have more queues than CPUs
> >   - use all of them
> >   - use less
> > - less queues than CPUs
> >   - mapped a queue to more than once CPU in case it goes down or becomes
> >     not available
> >   - mapped to one CPU
> > 
> > Ideally you solve this at one level so that the device(s) can request
> > less queues than CPUs if told so without patching each and every driver.
> > 
> > This should give you the freedom to isolate CPUs, decide at boot time
> > which CPUs get I/O queues assigned. At run time you can tell which
> > queues go to which CPUs. If you shutdown a queue, the interrupt remains
> > but does not get any I/O requests assigned so no problem. If the CPU
> > goes down, same thing.
> > 
> > I am trying to come up with a design here which I haven't found so far.
> > But I might be late to the party and everyone else is fully aware.
> > 
> > > If managed_irq were solely relied upon, the IRQ subsystem would
> > > successfully keep hardware interrupts off the isolated CPUs, but the block
> > 
> > The managed_irqs can't be influence by userland. The CPUs are auto
> > distributed.
> > 
> > > layer would still blindly map polling queues or non-managed queues to those
> > > same isolated CPUs. This would force isolated CPUs to process I/O
> > > submissions or handle polling tasks, thereby breaking the strict isolation.
> > > 
> > > Regarding the point about the networking subsystem, it is a very valid
> > > comparison. If the networking layer wishes to respect isolcpus in the
> > > future, adding a net flag would indeed exacerbate the bit proliferation.
> > 
> > Networking could also have different cases like adding a RX filter and
> > having HW putting packet based on it in a dedicated queue. But also in
> > this case I would like to have the freedome to decide which isolated
> > CPUs should receive interrupts/ traffic and which don't.
> > 
> > > For the present time, retaining io_queue seems the most prudent approach to
> > > ensure that block queue mapping remains semantically distinct from
> > > interrupt delivery. This provides an immediate and clean architectural
> > > boundary. However, if the consensus amongst the maintainers suggests that
> > > this is too granular, alternative approaches could certainly be considered
> > > for the future. For instance, a broader, more generic flag could be
> > > introduced to encompass both block and future networking queue mappings.
> > > Alternatively, if semantic conflation is deemed acceptable, the existing
> > > managed_irq housekeeping mask could simply be overloaded within the block
> > > layer to restrict all queue mappings.
> > > 
> > > Keeping the current separation appears to be the cleanest solution for this
> > > series, but your thoughts, and those of the wider community, on potentially
> > > migrating to a consolidated generic flag in the future would be very much
> > > welcomed.
> > 
> > I just don't like introducing yet another boot argument, making it a
> > boot constraint while in my naive view this could be managed at some
> > degree via sysfs as suggested above.
> 
> Hi Sebastian,
> 
> I believe it would be more prudent to defer to Thomas Gleixner and Jens
> Axboe on this matter.
> 
> 
> Indeed, I am entirely sympathetic to your reluctance to introduce yet
> another boot parameter, and I concur that run-time configurability
> represents the ideal scenario for system tuning.

`io_queue` introduces cost of potential failure on offlining CPU, so how
can it replace the existing `managed_irq`?

> 
> At present, a device such as an NVMe controller allocates its hardware
> queues and requests its interrupt vectors during the initial device probe
> phase. The block layer calculates the optimal queue to CPU mapping based on
> the system topology at that precise moment. Altering this mapping
> dynamically at runtime via sysfs would be an exceptionally intricate
> undertaking. It would necessitate freezing all active operations, tearing
> down the physical hardware queues on the device, renegotiating the
> interrupt vectors with the peripheral component interconnect subsystem, and
> finally reconstructing the entire queue map.
> 
> Furthermore, the proposed io_queue boot parameter successfully achieves the
> objective of avoiding driver level modifications. By applying the
> housekeeping mask constraint centrally within the core block layer mapping
> helpers, all multiqueue drivers automatically inherit the CPU isolation
> boundaries without requiring a single line of code to be changed within the
> individual drivers themselves.
> 
> Because the hardware queue count and CPU alignment must be calculated as
> the device initialises, a reliable mechanism is required to inform the
> block layer of which CPUs are strictly isolated before the probe sequence
> commences. This is precisely why integrating with the existing boot time
> housekeeping infrastructure is currently the most viable and robust
> solution.
> 
> Whilst a fully dynamic sysfs driven reconfiguration architecture would be a
> great, it would represent a substantial paradigm shift for the block layer.
> For the present time, the io_queue flag resolves the immediate and severe
> latency issues experienced by users with isolated CPUs, employing an
> established and safe methodology.

I'd suggest to document the exact existing problem, cause `managed_irq`
should cover it in try-best way, so people can know how to select the two
parameters.


Thanks,
Ming

next prev parent reply	other threads:[~2026-04-03  1:20 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 01/13] scsi: aacraid: use block layer helpers to calculate num of queues Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 02/13] lib/group_cpus: remove dead !SMP code Aaron Tomlin
2026-04-01 12:29   ` Sebastian Andrzej Siewior
2026-04-01 19:31     ` Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 03/13] lib/group_cpus: Add group_mask_cpus_evenly() Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 04/13] genirq/affinity: Add cpumask to struct irq_affinity Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 05/13] blk-mq: add blk_mq_{online|possible}_queue_affinity Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 06/13] nvme-pci: use block layer helpers to constrain queue affinity Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 07/13] scsi: Use " Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 08/13] virtio: blk/scsi: use " Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type Aaron Tomlin
2026-04-01 12:49   ` Sebastian Andrzej Siewior
2026-04-01 19:05     ` Waiman Long
2026-04-02  7:58       ` Sebastian Andrzej Siewior
2026-04-03  1:54         ` Waiman Long
2026-04-01 20:58     ` Aaron Tomlin
2026-04-02  9:09       ` Sebastian Andrzej Siewior
2026-04-03  0:50         ` Aaron Tomlin
2026-04-03  1:20           ` Ming Lei [this message]
2026-03-30 22:10 ` [PATCH v9 10/13] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Aaron Tomlin
2026-03-31 23:05   ` Keith Busch
2026-04-01 17:16     ` Aaron Tomlin
2026-04-03  1:43   ` Ming Lei
2026-03-30 22:10 ` [PATCH v9 11/13] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 12/13] genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 13/13] docs: add io_queue flag to isolcpus Aaron Tomlin
2026-03-31  1:01 ` [PATCH v9 00/13] blk: honor isolcpus configuration Ming Lei
2026-03-31 13:38   ` Aaron Tomlin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ac8V0RO-yg3juOox@fedora \
    --to=ming.lei@redhat.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=MPT-FusionLinux.pdl@broadcom.com \
    --cc=aacraid@microsemi.com \
    --cc=akpm@linux-foundation.org \
    --cc=atomlin@atomlin.com \
    --cc=axboe@kernel.dk \
    --cc=bigeasy@linutronix.de \
    --cc=chandrakanth.patil@broadcom.com \
    --cc=chenridong@huawei.com \
    --cc=chjohnst@gmail.com \
    --cc=frederic@kernel.org \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=jinpu.wang@cloud.ionos.com \
    --cc=juri.lelli@redhat.com \
    --cc=kashyap.desai@broadcom.com \
    --cc=kbusch@kernel.org \
    --cc=kch@nvidia.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=liyihang9@h-partners.com \
    --cc=longman@redhat.com \
    --cc=martin.petersen@oracle.com \
    --cc=maz@kernel.org \
    --cc=megaraidlinux.pdl@broadcom.com \
    --cc=mingo@redhat.com \
    --cc=mpi3mr-linuxdrv.pdl@broadcom.com \
    --cc=mproche@gmail.com \
    --cc=mst@redhat.com \
    --cc=neelx@suse.com \
    --cc=peterz@infradead.org \
    --cc=ranjan.kumar@broadcom.com \
    --cc=ruanjinjie@huawei.com \
    --cc=sagi@grimberg.me \
    --cc=sathya.prakash@broadcom.com \
    --cc=sean@ashe.io \
    --cc=shivasharan.srikanteshwara@broadcom.com \
    --cc=sreekanth.reddy@broadcom.com \
    --cc=steve@abita.co \
    --cc=suganath-prabu.subramani@broadcom.com \
    --cc=sumit.saxena@broadcom.com \
    --cc=tglx@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=virtualization@lists.linux.dev \
    --cc=wagi@kernel.org \
    --cc=yphbchou0911@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.