Re: [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Aaron Tomlin <atomlin@atomlin.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me,
	mst@redhat.com, aacraid@microsemi.com,
	James.Bottomley@hansenpartnership.com,
	martin.petersen@oracle.com, liyihang9@h-partners.com,
	kashyap.desai@broadcom.com, sumit.saxena@broadcom.com,
	shivasharan.srikanteshwara@broadcom.com,
	chandrakanth.patil@broadcom.com, sathya.prakash@broadcom.com,
	sreekanth.reddy@broadcom.com,
	suganath-prabu.subramani@broadcom.com, ranjan.kumar@broadcom.com,
	jinpu.wang@cloud.ionos.com, tglx@kernel.org, mingo@redhat.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, akpm@linux-foundation.org,
	maz@kernel.org, ruanjinjie@huawei.com, yphbchou0911@gmail.com,
	wagi@kernel.org, frederic@kernel.org, longman@redhat.com,
	chenridong@huawei.com, hare@suse.de, kch@nvidia.com,
	steve@abita.co, sean@ashe.io, chjohnst@gmail.com, neelx@suse.com,
	mproche@gmail.com, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, virtualization@lists.linux.dev,
	linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org,
	megaraidlinux.pdl@broadcom.com, mpi3mr-linuxdrv.pdl@broadcom.com,
	MPT-FusionLinux.pdl@broadcom.com
Subject: Re: [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type
Date: Fri, 3 Apr 2026 09:20:17 +0800	[thread overview]
Message-ID: <ac8V0RO-yg3juOox@fedora> (raw)
In-Reply-To: <sluplntvagevh6ehfm3kqinbh23d2gnin7stkptxk6drvogh2g@4hpz74fidrq2>

On Thu, Apr 02, 2026 at 08:50:55PM -0400, Aaron Tomlin wrote:
> On Thu, Apr 02, 2026 at 11:09:40AM +0200, Sebastian Andrzej Siewior wrote:
> > On 2026-04-01 16:58:22 [-0400], Aaron Tomlin wrote:
> > > Hi Sebastian,
> > Hi,
> > 
> > > Thank you for taking the time to document the "managed_irq" behaviour; it
> > > is immensely helpful. You raise a highly pertinent point regarding the
> > > potential proliferation of "isolcpus=" flags. It is certainly a situation
> > > that must be managed carefully to prevent every subsystem from demanding
> > > its own bit.
> > > 
> > > To clarify the reasoning behind introducing "io_queue" rather than strictly
> > > relying on managed_irq:
> > > 
> > > The managed_irq flag belongs firmly to the interrupt subsystem. It dictates
> > > whether a CPU is eligible to receive hardware interrupts whose affinity is
> > > managed by the kernel. Whilst many modern block drivers use managed IRQs,
> > > the block layer multi-queue mapping encompasses far more than just
> > > interrupt routing. It maps logical queues to CPUs to handle I/O submission,
> > > software queues, and crucially, poll queues, which do not utilise
> > > interrupts at all. Furthermore, there are specific drivers that do not use
> > > the managed IRQ infrastructure but still rely on the block layer for queue
> > > distribution.
> > 
> > Could you tell block which queue maps to which CPU at /sys/block/$$/mq/
> > level? Then you have one queue going to one CPU.
> > Then the drive could request one or more interrupts managed or not. For
> > managed you could specify a CPU mask which you desire to occupy.
> > You have the case where
> > - you have more queues than CPUs
> >   - use all of them
> >   - use less
> > - less queues than CPUs
> >   - mapped a queue to more than once CPU in case it goes down or becomes
> >     not available
> >   - mapped to one CPU
> > 
> > Ideally you solve this at one level so that the device(s) can request
> > less queues than CPUs if told so without patching each and every driver.
> > 
> > This should give you the freedom to isolate CPUs, decide at boot time
> > which CPUs get I/O queues assigned. At run time you can tell which
> > queues go to which CPUs. If you shutdown a queue, the interrupt remains
> > but does not get any I/O requests assigned so no problem. If the CPU
> > goes down, same thing.
> > 
> > I am trying to come up with a design here which I haven't found so far.
> > But I might be late to the party and everyone else is fully aware.
> > 
> > > If managed_irq were solely relied upon, the IRQ subsystem would
> > > successfully keep hardware interrupts off the isolated CPUs, but the block
> > 
> > The managed_irqs can't be influence by userland. The CPUs are auto
> > distributed.
> > 
> > > layer would still blindly map polling queues or non-managed queues to those
> > > same isolated CPUs. This would force isolated CPUs to process I/O
> > > submissions or handle polling tasks, thereby breaking the strict isolation.
> > > 
> > > Regarding the point about the networking subsystem, it is a very valid
> > > comparison. If the networking layer wishes to respect isolcpus in the
> > > future, adding a net flag would indeed exacerbate the bit proliferation.
> > 
> > Networking could also have different cases like adding a RX filter and
> > having HW putting packet based on it in a dedicated queue. But also in
> > this case I would like to have the freedome to decide which isolated
> > CPUs should receive interrupts/ traffic and which don't.
> > 
> > > For the present time, retaining io_queue seems the most prudent approach to
> > > ensure that block queue mapping remains semantically distinct from
> > > interrupt delivery. This provides an immediate and clean architectural
> > > boundary. However, if the consensus amongst the maintainers suggests that
> > > this is too granular, alternative approaches could certainly be considered
> > > for the future. For instance, a broader, more generic flag could be
> > > introduced to encompass both block and future networking queue mappings.
> > > Alternatively, if semantic conflation is deemed acceptable, the existing
> > > managed_irq housekeeping mask could simply be overloaded within the block
> > > layer to restrict all queue mappings.
> > > 
> > > Keeping the current separation appears to be the cleanest solution for this
> > > series, but your thoughts, and those of the wider community, on potentially
> > > migrating to a consolidated generic flag in the future would be very much
> > > welcomed.
> > 
> > I just don't like introducing yet another boot argument, making it a
> > boot constraint while in my naive view this could be managed at some
> > degree via sysfs as suggested above.
> 
> Hi Sebastian,
> 
> I believe it would be more prudent to defer to Thomas Gleixner and Jens
> Axboe on this matter.
> 
> 
> Indeed, I am entirely sympathetic to your reluctance to introduce yet
> another boot parameter, and I concur that run-time configurability
> represents the ideal scenario for system tuning.

`io_queue` introduces cost of potential failure on offlining CPU, so how
can it replace the existing `managed_irq`?

> 
> At present, a device such as an NVMe controller allocates its hardware
> queues and requests its interrupt vectors during the initial device probe
> phase. The block layer calculates the optimal queue to CPU mapping based on
> the system topology at that precise moment. Altering this mapping
> dynamically at runtime via sysfs would be an exceptionally intricate
> undertaking. It would necessitate freezing all active operations, tearing
> down the physical hardware queues on the device, renegotiating the
> interrupt vectors with the peripheral component interconnect subsystem, and
> finally reconstructing the entire queue map.
> 
> Furthermore, the proposed io_queue boot parameter successfully achieves the
> objective of avoiding driver level modifications. By applying the
> housekeeping mask constraint centrally within the core block layer mapping
> helpers, all multiqueue drivers automatically inherit the CPU isolation
> boundaries without requiring a single line of code to be changed within the
> individual drivers themselves.
> 
> Because the hardware queue count and CPU alignment must be calculated as
> the device initialises, a reliable mechanism is required to inform the
> block layer of which CPUs are strictly isolated before the probe sequence
> commences. This is precisely why integrating with the existing boot time
> housekeeping infrastructure is currently the most viable and robust
> solution.
> 
> Whilst a fully dynamic sysfs driven reconfiguration architecture would be a
> great, it would represent a substantial paradigm shift for the block layer.
> For the present time, the io_queue flag resolves the immediate and severe
> latency issues experienced by users with isolated CPUs, employing an
> established and safe methodology.

I'd suggest to document the exact existing problem, cause `managed_irq`
should cover it in try-best way, so people can know how to select the two
parameters.


Thanks,
Ming

next prev parent reply	other threads:[~2026-04-04  4:44 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-30 22:10 [PATCH v9 00/13] blk: honor isolcpus configuration Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 01/13] scsi: aacraid: use block layer helpers to calculate num of queues Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 02/13] lib/group_cpus: remove dead !SMP code Aaron Tomlin
2026-04-01 12:29   ` Sebastian Andrzej Siewior
2026-04-01 19:31     ` Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 03/13] lib/group_cpus: Add group_mask_cpus_evenly() Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 04/13] genirq/affinity: Add cpumask to struct irq_affinity Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 05/13] blk-mq: add blk_mq_{online|possible}_queue_affinity Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 06/13] nvme-pci: use block layer helpers to constrain queue affinity Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 07/13] scsi: Use " Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 08/13] virtio: blk/scsi: use " Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type Aaron Tomlin
2026-04-01 12:49   ` Sebastian Andrzej Siewior
2026-04-01 19:05     ` Waiman Long
2026-04-02  7:58       ` Sebastian Andrzej Siewior
2026-04-03  1:54         ` Waiman Long
2026-04-01 20:58     ` Aaron Tomlin
2026-04-02  9:09       ` Sebastian Andrzej Siewior
2026-04-03  0:50         ` Aaron Tomlin
2026-04-03  1:20           ` Ming Lei [this message]
2026-03-30 22:10 ` [PATCH v9 10/13] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Aaron Tomlin
2026-03-31 23:05   ` Keith Busch
2026-04-01 17:16     ` Aaron Tomlin
2026-04-03  1:43   ` Ming Lei
2026-03-30 22:10 ` [PATCH v9 11/13] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 12/13] genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs Aaron Tomlin
2026-03-30 22:10 ` [PATCH v9 13/13] docs: add io_queue flag to isolcpus Aaron Tomlin
2026-03-31  1:01 ` [PATCH v9 00/13] blk: honor isolcpus configuration Ming Lei
2026-03-31 13:38   ` Aaron Tomlin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ac8V0RO-yg3juOox@fedora \
    --to=ming.lei@redhat.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=MPT-FusionLinux.pdl@broadcom.com \
    --cc=aacraid@microsemi.com \
    --cc=akpm@linux-foundation.org \
    --cc=atomlin@atomlin.com \
    --cc=axboe@kernel.dk \
    --cc=bigeasy@linutronix.de \
    --cc=chandrakanth.patil@broadcom.com \
    --cc=chenridong@huawei.com \
    --cc=chjohnst@gmail.com \
    --cc=frederic@kernel.org \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=jinpu.wang@cloud.ionos.com \
    --cc=juri.lelli@redhat.com \
    --cc=kashyap.desai@broadcom.com \
    --cc=kbusch@kernel.org \
    --cc=kch@nvidia.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=liyihang9@h-partners.com \
    --cc=longman@redhat.com \
    --cc=martin.petersen@oracle.com \
    --cc=maz@kernel.org \
    --cc=megaraidlinux.pdl@broadcom.com \
    --cc=mingo@redhat.com \
    --cc=mpi3mr-linuxdrv.pdl@broadcom.com \
    --cc=mproche@gmail.com \
    --cc=mst@redhat.com \
    --cc=neelx@suse.com \
    --cc=peterz@infradead.org \
    --cc=ranjan.kumar@broadcom.com \
    --cc=ruanjinjie@huawei.com \
    --cc=sagi@grimberg.me \
    --cc=sathya.prakash@broadcom.com \
    --cc=sean@ashe.io \
    --cc=shivasharan.srikanteshwara@broadcom.com \
    --cc=sreekanth.reddy@broadcom.com \
    --cc=steve@abita.co \
    --cc=suganath-prabu.subramani@broadcom.com \
    --cc=sumit.saxena@broadcom.com \
    --cc=tglx@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=virtualization@lists.linux.dev \
    --cc=wagi@kernel.org \
    --cc=yphbchou0911@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox